# Data Engineering in Spark – II 
© Explore Data Science Academy

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Processing%20Big%20Data/spark.png"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     />

</div>

## Learning objectives

In this train, you learn how to implement common data engineering transformations. These include: 

  - enrichment and imputation;
  - indexing and ordering; 
  - anonymisation and encryption;
  - modeling, typecasting, formatting, and renaming; and 
  - pivoting

## Introduction

In this train, we are continuing the work on common transformations in Apache Spark. In the previous train, we fully characterised the dataset for any flaws or shortcomings. Here, we are going to adjust the dataset and correct any flaws identified. We will also introduce more advanced techniques to assist with the necessary transformations. Although this will not be the full scope of what you can achieve when using Spark, it will definitely ensure that you understand the fundamentals. Transforming data is a very detail-orientated task, thus, a thorough understanding of the code snippets throughout this train will go a long way into preparing you for your data engineering journey.

Let's start by reading in the dataset that we previously saved as parquet:

In [32]:
# Import Spark and some auxiliary functions, and set up a SparkSession.

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, FloatType, TimestampType

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

In [6]:
working_df = spark.read.parquet('./data/weather_data/hrly_Irish_weather/')

Also define categorical and continuous columns:

In [30]:
weather_cat = ['county', 'station', 'date']
weather_int = ['rhum', 'wdsp', 'wddir', 'vis', 'clht', 'clamt']

# Create a list of continuous fields (fields that are not in the categorical list above).
weather_cont = [x for x in working_df.columns if x not in weather_cat]

# Create a list of float fields (those in the above list, but not in the integer list).
weather_float = [x for x in weather_cont if x not in weather_int] 


## Data enrichment and imputation

After we have characterised and summarised the incoming dataset, we can try to improve and enrich the dataset using some data engineering techniques. 

Missing data can be addressed by infilling or imputation. 

The dataset can also be enriched by creating new fields, which can be precomputed metrics (think percentages, means, or some windowed function), or cleaning up the current features (think splitting text-dense columns). 

### Enriching data

We'll start enriching the dataset by adding new inferred fields. 

To enhance this dataset, let's try to infer the associated season based on the temperature, and then validate the season using the date.

There are a couple of things that we have to do to perform this operation:

1. Create a day column to window over.
2. Get the **mean** daily temperature.
3. Get the **mean**, **minimum**, and **maximum** temperatures per month.
4. Estimate the month based on temperature.
5. Estimate the season based on month.
6. Validate against the real month.

In [7]:
# Create day and month column.
working_df = working_df.withColumn('day', F.dayofyear('date'))
working_df = working_df.withColumn('month', F.month('date'))

In [8]:
from pyspark.sql import Window
# Create daily and monthly windows using Window class and partitionBy() method,
# using the 'day' and 'month' columns respectively.
day_window = Window.partitionBy(F.col('day'))
month_window = Window.partitionBy(F.col('month'))

# First, make a copy of the DataFrame.
working_df_season = working_df

# Get the daily minimum and maximum using the above defined windows.
working_df_season = working_df_season.withColumn('temp_daily_max', F.max('temp').over(day_window))
working_df_season = working_df_season.withColumn('temp_daily_min', F.min('temp').over(day_window))

# Get the mean for the minimum and maximum temperatures 
# using the mean() function with the monthly window.
working_df_season = working_df_season.withColumn('temp_monthly_min', F.mean('temp_daily_min').over(month_window))
working_df_season = working_df_season.withColumn('temp_monthly_max', F.mean('temp_daily_max').over(month_window))

To estimate the season, we want to compare our calculated minimum and maximum values against historic values. We retrieve these values from [Wikipedia](https://en.wikipedia.org/wiki/Climate_of_Ireland).

In [9]:
# Climactic data from Wikipedia.
rough_min_max_ireland = {
    1: {'min': 2, 'max': 8},
    2: {'min': 2, 'max': 8},
    3: {'min': 3, 'max': 10},
    4: {'min': 5, 'max': 12},
    5: {'min': 7, 'max': 15},
    6: {'min': 10, 'max': 18},
    7: {'min': 12, 'max': 20},
    8: {'min': 12, 'max': 19},
    9: {'min': 10, 'max': 17},
    10: {'min': 7, 'max': 14},
    11: {'min': 5, 'max': 10},
    12: {'min': 3, 'max': 5}
}

Let's group by month and get the mean values for each month, to compare against the values retrieved from Wikipedia.

In [10]:
working_df_season.groupBy('month').mean().select('month', 'avg(temp_monthly_min)', 'avg(temp_monthly_max)').orderBy('month').show()

+-----+---------------------+---------------------+
|month|avg(temp_monthly_min)|avg(temp_monthly_max)|
+-----+---------------------+---------------------+
|    1|   -6.294776098179843|   13.973935970171693|
|    2|    -5.83057417469433|   14.518216912065954|
|    3|   -4.845702851176519|   17.019559528027298|
|    4|  -2.9175169930454037|    20.41644317842727|
|    5|  -0.8002945385715458|   24.354988555758272|
|    6|  -0.4465843441516472|    26.21454659678375|
|    7|                  0.0|    27.30916489960508|
|    8| -0.00323508493013298|   26.269253851997522|
|    9|  -0.2034015738605942|   23.701991028867827|
|   10|  -1.9310375362803789|   19.284097329660977|
|   11|   -4.177552947793544|    15.99675852985561|
|   12|   -9.814020929424249|   14.417645506544467|
+-----+---------------------+---------------------+



Comparing the retrieved values to that of Wikipedia, the values are clearly more extreme, with minimum values lower than expected and maximum values higher. 

But still, it seems like the highest maximum temperatures are centred around summer and the lowest around winter. Let's see if we can use this to infer the seasons. 

In [11]:
# Create a DataFrame that we can manipulate, similar to the above DataFrame, ordered by temperature.

seasons_df = working_df_season.groupBy('month').mean().select('month', 'avg(temp_monthly_min)', 'avg(temp_monthly_max)').orderBy('avg(temp_monthly_max)')

In [12]:
seasons_df.show()

+-----+---------------------+---------------------+
|month|avg(temp_monthly_min)|avg(temp_monthly_max)|
+-----+---------------------+---------------------+
|    1|   -6.294776098179843|   13.973935970171693|
|   12|   -9.814020929424249|   14.417645506544467|
|    2|    -5.83057417469433|   14.518216912065954|
|   11|   -4.177552947793544|    15.99675852985561|
|    3|   -4.845702851176519|   17.019559528027298|
|   10|  -1.9310375362803789|   19.284097329660977|
|    4|  -2.9175169930454037|    20.41644317842727|
|    9|  -0.2034015738605942|   23.701991028867827|
|    5|  -0.8002945385715458|   24.354988555758272|
|    6|  -0.4465843441516472|    26.21454659678375|
|    8| -0.00323508493013298|   26.269253851997522|
|    7|                  0.0|    27.30916489960508|
+-----+---------------------+---------------------+



In [13]:
# Create a window that is ordered by the maximum temperature.
wndw = Window.orderBy('avg(temp_monthly_max)')

# Create a rank over the window defined above.
seasons_df = seasons_df.withColumn('temp_rank', F.dense_rank().over(wndw))

seasons_df.show()

+-----+---------------------+---------------------+---------+
|month|avg(temp_monthly_min)|avg(temp_monthly_max)|temp_rank|
+-----+---------------------+---------------------+---------+
|    1|   -6.294776098179843|   13.973935970171693|        1|
|   12|   -9.814020929424249|   14.417645506544467|        2|
|    2|    -5.83057417469433|   14.518216912065954|        3|
|   11|   -4.177552947793544|    15.99675852985561|        4|
|    3|   -4.845702851176519|   17.019559528027298|        5|
|   10|  -1.9310375362803789|   19.284097329660977|        6|
|    4|  -2.9175169930454037|    20.41644317842727|        7|
|    9|  -0.2034015738605942|   23.701991028867827|        8|
|    5|  -0.8002945385715458|   24.354988555758272|        9|
|    6|  -0.4465843441516472|    26.21454659678375|       10|
|    8| -0.00323508493013298|   26.269253851997522|       11|
|    7|                  0.0|    27.30916489960508|       12|
+-----+---------------------+---------------------+---------+



Now we have a DataFrame that is ranked by maximum temperature. Using this, we can assign the hottest and coldest months to summer and winter. We write a simple Python function to extract the first and last three months based on temperature rank and assign them as either winter or summer. 

In [14]:
def summer_or_winter(temp_rank):
    if temp_rank in [1, 2, 3]:
        return 'winter'
    if temp_rank in [10, 11, 12]:
        return 'summer'
    

In [15]:
# Convert the Python function to a Spark UDF.

udf_season = F.udf(lambda x:summer_or_winter(x))

In [16]:
seasons_df = seasons_df.withColumn('season', udf_season(F.col('temp_rank')))

seasons_df.orderBy('month').show()

+-----+---------------------+---------------------+---------+------+
|month|avg(temp_monthly_min)|avg(temp_monthly_max)|temp_rank|season|
+-----+---------------------+---------------------+---------+------+
|    1|   -6.294776098179843|   13.973935970171693|        1|winter|
|    2|    -5.83057417469433|   14.518216912065954|        3|winter|
|    3|   -4.845702851176519|   17.019559528027298|        5|  null|
|    4|  -2.9175169930454037|    20.41644317842727|        7|  null|
|    5|  -0.8002945385715458|   24.354988555758272|        9|  null|
|    6|  -0.4465843441516472|    26.21454659678375|       10|summer|
|    7|                  0.0|    27.30916489960508|       12|summer|
|    8| -0.00323508493013298|   26.269253851997522|       11|summer|
|    9|  -0.2034015738605942|   23.701991028867827|        8|  null|
|   10|  -1.9310375362803789|   19.284097329660977|        6|  null|
|   11|   -4.177552947793544|    15.99675852985561|        4|  null|
|   12|   -9.814020929424249|   14

Great! Now we have assignments for summer and winter. Autumn and spring are a little bit more difficult since they will inherently have overlapping temperatures. Let's use the order of the months this time to assign the months to the seasons. 

For this, we will use a simple function that compares the current month to that of summer and assigns a season based on that. 


In [17]:
def autumn_or_spring(month, season):
    if not season:
        if month < 6:
            return 'spring'
        elif month > 6:
            return 'autumn'
    else:
        return season
    

In [18]:
# Convert the Python function to a Spark UDF.

udf_season = F.udf(lambda x,y:autumn_or_spring(x,y))

In [19]:
seasons_df = seasons_df.withColumn('season', udf_season(F.col('month'), F.col('season')))

seasons_df.orderBy('month').show()

+-----+---------------------+---------------------+---------+------+
|month|avg(temp_monthly_min)|avg(temp_monthly_max)|temp_rank|season|
+-----+---------------------+---------------------+---------+------+
|    1|   -6.294776098179843|   13.973935970171693|        1|winter|
|    2|    -5.83057417469433|   14.518216912065954|        3|winter|
|    3|   -4.845702851176519|   17.019559528027298|        5|spring|
|    4|  -2.9175169930454037|    20.41644317842727|        7|spring|
|    5|  -0.8002945385715458|   24.354988555758272|        9|spring|
|    6|  -0.4465843441516472|    26.21454659678375|       10|summer|
|    7|                  0.0|    27.30916489960508|       12|summer|
|    8| -0.00323508493013298|   26.269253851997522|       11|summer|
|    9|  -0.2034015738605942|   23.701991028867827|        8|autumn|
|   10|  -1.9310375362803789|   19.284097329660977|        6|autumn|
|   11|   -4.177552947793544|    15.99675852985561|        4|autumn|
|   12|   -9.814020929424249|   14

There we go! All the seasons are assigned.

Let's compare that with seasons purely derived from what we know about the months of the year.

In [20]:
season_dict = {
    1: 'winter',
    2: 'winter',
    3: 'spring',
    4: 'spring',
    5: 'spring',
    6: 'summer',
    7: 'summer',
    8: 'summer',
    9: 'autumn',
    10: 'autumn',
    11: 'autumn',
    12: 'winter'    
}

In [21]:
from itertools import chain
# Create a map type to map the above dictionary to the DataFrame. 
# The create_map() function requires a list of columns to convert into a map type.
# We convert each item in the above dictionary into a column using the lit() function.
# Using list comprehension and the chain() function, we create a list of all items.
mapping_expr = F.create_map([F.lit(x) for x in chain(*season_dict.items())])

# We apply the map type to the month column.
seasons_df = seasons_df.withColumn("season_truth", mapping_expr.getItem(seasons_df["month"]))

Let's run through what just happened. Within the list comprehension, we first created a list of key-value pairs using the `chain()` function from `itertools`. This creates a list from the dictionary: `[1, 'winter', 2, 'winter'...]`, and then the `lit()` function converts each item in the list into a column object in Spark. 
Finally, the [`create_map()`](https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.create_map) function converts it to a map type in Spark, which can be used to transform columns based on column values. 

We then use the map to transform the month column into seasons.


In [22]:
seasons_df.show()

+-----+---------------------+---------------------+---------+------+------------+
|month|avg(temp_monthly_min)|avg(temp_monthly_max)|temp_rank|season|season_truth|
+-----+---------------------+---------------------+---------+------+------------+
|    1|   -6.294776098179843|   13.973935970171693|        1|winter|      winter|
|   12|   -9.814020929424249|   14.417645506544467|        2|winter|      winter|
|    2|    -5.83057417469433|   14.518216912065954|        3|winter|      winter|
|   11|   -4.177552947793544|    15.99675852985561|        4|autumn|      autumn|
|    3|   -4.845702851176519|   17.019559528027298|        5|spring|      spring|
|   10|  -1.9310375362803789|   19.284097329660977|        6|autumn|      autumn|
|    4|  -2.9175169930454037|    20.41644317842727|        7|spring|      spring|
|    9|  -0.2034015738605942|   23.701991028867827|        8|autumn|      autumn|
|    5|  -0.8002945385715458|   24.354988555758272|        9|spring|      spring|
|    6|  -0.4465

The above two methods result in the exact same output.

Looking at the above two methods for deriving seasons, it is clear that the second required a lot less code to write and is much more efficient. The first would be more appropriate if we wanted to see if climactic shifts are causing a change in seasonal temperatures. 

It does, however, make use of all the features of Spark, and thus we include it for illustrative purposes. 

Let's now apply the second method to the working DataFrame:

In [23]:
# We apply the map type to the month column.
working_df = working_df.withColumn("season", mapping_expr.getItem(working_df["month"]))

### Types of missing data 
Missing data causes three main problems:

- It can introduce bias in the dataset.
- It complicates the analysis of the dataset since missing values have to be treated differently.
- It reduces computational efficiency.

Not only is missing data the bane of existence for any data engineer or data scientist but Spark also makes it just a little bit more complicated, as it has built-in types as well as having to deal with SQL types and types of the various programming languages it is compatible with. 


**Null values in Spark**

When the value of a specific column in a specific row is not known, it is represented as a `NULL` in SQL. Because Spark SQL is the base engine on which the rest is built, this is the base representation of missing values in Spark. These values can be in any column of any type, depending on if the argument `nullable` was set to `True` or `False`. 

Let's look at the behaviour of `NULL` values in Spark:
- Any comparison with a `NULL` value will return a `NULL` value.

Here's an instance where the 'dewpt' is null:

In [24]:
working_df.where(F.isnull('dewpt')).select('county', 'station', 'dewpt', 'rhum').show(2)

+------+-------+-----+----+
|county|station|dewpt|rhum|
+------+-------+-----+----+
|Galway|ATHENRY| null|  85|
|Galway|ATHENRY| null|  88|
+------+-------+-----+----+
only showing top 2 rows



We do a comparison with 'rhum', writing the output to a new column:

In [25]:
working_df.withColumn('dewpt_rhum', F.col('dewpt') == F.col('rhum')).where(F.isnull('dewpt')).select('county', 'station', 'dewpt', 'rhum', 'dewpt_rhum').show(2)

+------+-------+-----+----+----------+
|county|station|dewpt|rhum|dewpt_rhum|
+------+-------+-----+----+----------+
|Galway|ATHENRY| null|  85|      null|
|Galway|ATHENRY| null|  88|      null|
+------+-------+-----+----+----------+
only showing top 2 rows



As you can see, the output from the comparison is 'null'.

*Spark provides a null-safe operator, which returns `False` if one of the operators is `NULL`*.

- In logical operations the following table describes the behaviour of Spark:

    | Left Operand | Right Operand | OR   | AND   |
    |--------------|---------------|------|-------|
    | True         | NULL          | True | NULL  |
    | False        | NULL          | NULL | False |
    | NULL         | True          | True | NULL  |
    | NULL         | False         | NULL | NULL  |
    | NULL         | NULL          | NULL | NULL  |


- Spark supports a set of expressions that are tolerant to null values:
  - COALESCE
  - NULLIF
  - IFNULL
  - NVL
  - NVL2
  - ISNAN
  - NANVL
  - ISNULL
  - ISNOTNULL
  - ATLEASTNNONNULLS
  - IN
  
The behaviour of each will depend on the specific function. For more information, refer to the Spark functions documentation.

- Some are intolerant to Null values:
  - CONCAT
  - POSITIVE
  - TO_DATE
  - TO_TIMESTAMP

- Aggregate functions in Spark will ignore values, except for the COUNT function.

Only if all values are NULL will the above functions return NULL.

- WHERE, HAVING and JOIN clauses follow the same rules where NULL values are excluded, except if specified, for example, SELECT * FROM table WHERE field IS NULL.
- When grouping values, NULL values are grouped into a separate group.
- When ordering data, NULL values are placed first by default.

This is quite a mouthful to get you started on NULLs. The best way to understand is with practice, which we will get to after working with NaNs.


**NaN values in Spark**

NaN values are representations of a value that cannot be computed referring to `float` or `double` types. It is 'Not a Number'. Some of the specific characteristics when working with NaNs are:

- NaN = NaN returns True.
- When aggregating data, NaN values are grouped.
- NaNs are treated as normal values in joins.
- NaNs are placed last when ordering data, being larger than any other value.

We have no NaNs here (which should hopefully be the case most of the time for data engineers), but there are many `Null` values. This should not come as a surprise to you, but in the preceding code, we have already performed many operations with the `Null` values included in the calculations. In the next section, we will try to impute the values to remove all `Null` values. 

### Imputing data

As mentioned above, sometimes our datasets have a lot of missing data. While it may seem intuitive and simple to just drop all rows which contain one or more missing values (and, indeed, many statistical packages either allow you to or do this by default), there must be more efficient ways of dealing with missing data.

This is where imputation or infilling comes in. Imputation is defined as the process of replacing missing values in a table or dataset with substituted values. These substituted values are normally based on additional data contained within the dataset, such as contextual data. 

Imputation takes numerous forms:

- Hot-deck
- Cold-deck
- Mean substitution
- Non-negative matrix factorization (NNMF)
- Regression
- Last observation
- Stochastic sampling
- K-nearest neighbours imputation
- Multiple imputations
- Deep learning

The specific imputation technique applied will depend on the use case. For example, if you are dealing with time series sensor data you may go with the last observation. If dealing with house prices, a mean substitution may be most appropriate, or a regression. 

Imputation is a whole discipline on its own, and a full treatment of its various techniques are well beyond the scope of this train. To find out more, you can read an overview of the topic [here](https://en.wikipedia.org/wiki/Imputation_(statistics)).

For this dataset, it makes a lot of sense to impute the temperature with the mean temperature for that day, since temperature may not be logged continuously. 

Let's implement this using the Window function.

It is clear that data from a source system is not clean at all, containing many flaws and challenges that first have to be addressed. This step in the data engineering toolkit allows us more opportunities to do just that. Here we move past just removing bad data, but rather to creating and infilling new data. 

In [26]:
day_window = Window.partitionBy(F.col('day'), F.col('station'))

working_df_imputed = working_df.withColumn('temp', F.when(F.isnull(F.col('temp')), F.mean('temp').over(day_window)).otherwise(F.col('temp')))

Again, there's quite a bit to unpack here. But by now you should be used to the fact that you can make Spark do pretty much anything you want in a single line. 

In this example, we first create a window that is per day, per station. This will allow us to create values that are specific to the station for a given day. 
Next, we create a new column for the temperature (it just so happens that we actually want all the temperature data in one column, so we just recreate `temp`). Within the column, we create a `when()` condition to check if the column is empty. If it is empty, we fill it with the mean, over the window previously defined. If it's not empty, we just fill it with the value already in the temperature column.

To check that we are not lying to you, you can run a few tests to validate that this is the case. (Pro-tip: do not put the values into `temp` but rather into an intermediate column to validate.)

In [27]:
# Let's check if temp is empty.

working_df_imputed.select(F.count(F.when(F.isnull('temp'),'temp')).alias('temp')).collect()

[Row(temp=0)]

Great!

This, however, is a gross simplification of the actual situation, since temperatures will change during the day. A more intuitive and logical imputation may be to impute the temperature using the mean value for temperature for the specific hour over the month within which the reading was taken.

Try and implement that on your own:

For the remainder of the dataset, we will stick with a mean imputation over the season per station for each of the datasets:

In [28]:
day_window = Window.partitionBy(F.col('season'), F.col('station'))

cols_to_impute = [
    'rhum',
    'wdsp',
    'wddir',
    'vis',
    'clht',
    'clamt',
    'rain',
    'wetb',
    'dewpt',
    'vappr',
    'msl',
    'sun'
]

for field in cols_to_impute:
    working_df_imputed = working_df.withColumn(field, F.when(F.isnull(F.col(field)), 
                                                F.mean(field).over(day_window)).otherwise(F.col(field)))

In [33]:
# Let's check the failures again.

failures = {}
test_df = working_df_imputed

for col in weather_int:
    before_count = test_df.select(F.count(F.when(F.isnull(col), col)).alias(col)).collect()
    test_df = test_df.withColumn(f'test_type_{col}', F.col(col).cast(IntegerType()))
    after_count = test_df.select(F.count(F.when(F.isnull(f'test_type_{col}'), f'test_type_{col}')).alias(f'test_type_{col}')).collect()
    failures[col] = after_count[0][0] - before_count[0][0]
    
for col in weather_float:
    before_count = test_df.select(F.count(F.when(F.isnull(col), col)).alias(col)).collect()
    test_df = test_df.withColumn(f'test_type_{col}', F.col(col).cast(FloatType()))
    after_count = test_df.select(F.count(F.when(F.isnull(f'test_type_{col}'), f'test_type_{col}')).alias(f'test_type_{col}')).collect()
    failures[col] = after_count[0][0] - before_count[0][0]

In [34]:
failures

{'rhum': 0,
 'wdsp': 0,
 'wddir': 0,
 'vis': 0,
 'clht': 0,
 'clamt': 0,
 'latitude': 0,
 'longitude': 0,
 'rain': 0,
 'temp': 0,
 'wetb': 0,
 'dewpt': 0,
 'vappr': 0,
 'msl': 0,
 'sun': 0,
 'day': 0,
 'month': 0,
 'season': 4660423}

There are no more missing values. Yay!

This allows us to create a dataset that is at least complete. However, for data lineage and documentation it is extremely important to note all ingestion assumptions and imputations that have been performed to ensure that all other teams consuming the data are aware of all changes that were made. 

It is also important to expose a bronze version of the dataset to other teams who may want to perform a different set of imputations or transformations. 

## Indexing and ordering

Sometimes it may be beneficial to order data in such a way that it will logically be structured on a data storage system. This may be done through indexing and partitioning for fast access in applications like Spark. (Remember, Spark always tries to process data that are logically closest to the executors.) Also, when filtering data, Spark is most efficient if it already knows where to look.

The same principle applies when working with relational databases, where indexes allow for data to be searched very efficiently and for relationships to be maintained between separate tables.

While Spark does not inherently allow for indexes on tables (data are rather physically separated into partitions), it is still possible for us to create indexes for tables that will later be written to SQL or another relational database.

Let's index the original DataFrame that we have:

In [35]:
working_df = working_df.orderBy('date', 'station')

working_df = working_df.withColumn('idx', F.monotonically_increasing_id())

Doing the above two operations serves two purposes. First, we order the data by date and station, meaning that all the data points for the first date will be together, then grouped by the station. 

Next, we added an index. If we are to write this data to SQL, we will be able to use this index. 
This allows us to then partition the data correctly in Spark (the exact details which are beyond the scope of this train), and also allow lightning-quick access in SQL.

In [36]:
working_df.show(10)

+---------+--------------------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+-----+----+-----+---+-----+------+---+
|   county|             station|latitude|longitude|               date|rain|temp|wetb|dewpt|vappr|rhum|   msl|wdsp|wddir| sun|  vis|clht|clamt|day|month|season|idx|
+---------+--------------------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+-----+----+-----+---+-----+------+---+
|     Mayo|           BELMULLET|  54.228|  -10.007|1990-01-01 00:00:00| 0.0| 7.6| 7.1|  6.5|  9.7|  93|1003.0|  10|  200| 0.0|26000|  16|    8|  1|    1|winter|  0|
|   Dublin|            CASEMENT|  53.306|   -6.439|1990-01-01 00:00:00| 0.0| 9.2| 8.5|  7.8| 10.5|  91|1007.9|  13|  160| 0.0|15000|  14|    7|  1|    1|winter|  1|
|     Mayo|         CLAREMORRIS|  53.711|   -8.993|1990-01-01 00:00:00| 0.1| 7.1| 7.0|  6.9|  9.9|  99|1004.5|   7|  160|null| null|null| null|  1|    1|winter|  2|
|     Cork

## Anonymisation and encryption

Data is a commodity in the information age. As such, it is our responsibility as data engineers to ensure that personally identifiable information (PII) that we have from individuals do not reach the hands of malicious agents. 

To ensure this, any PII data should be anonymised before being transferred outside of our systems. Fortunately, encryption is mostly taken care of by modern cloud providers, and the details thereof are beyond the scope of this train. More information on [why it's important](https://aws.amazon.com/blogs/security/importance-of-encryption-and-how-aws-can-help/) and [how cloud providers can help](https://docs.aws.amazon.com/whitepapers/latest/introduction-aws-security/data-encryption.html) can be found [here](https://docs.microsoft.com/en-us/azure/security/fundamentals/encryption-overview). Anonymisation can happen at various levels: cell level, field level, or row level. Luckily for us, no fields in the above dataset contain PII data.


If any of the fields did contain PII data, we might have to treat the whole table as sensitive, meaning we would have to restrict access and make sure that when presented or given to analytics teams they have the correct data protection assessment in place. One possible way of dealing with sensitive data includes having a separate data lake or blob store that only deals with sensitive data. Access to this lake will, thus, be restricted and will have directory-based access limitations. Another method for dealing with sensitive data would be to mask or remove the sensitive field. 



## Modeling, typecasting, formatting, and renaming

The last step. Finally. 

Our users will want to have the data prepared in a specific format, type, and structure. This last step in the data engineering toolkit is all about forming and manipulating the data into a structure and format that is acceptable for our end-users. Let's prepare our data for output onto a PowerBI dashboard. For this, we require the data in a wide format, with fields labelled with the full descriptions, and all field types as either string, float, or integer. 

> **Definitions** 📝
> 
>More on wide and narrow datasets and their uses can be found [here](https://en.wikipedia.org/wiki/Wide_and_narrow_data).

In [37]:
working_df.printSchema()

root
 |-- county: string (nullable = true)
 |-- station: string (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- rain: float (nullable = true)
 |-- temp: float (nullable = true)
 |-- wetb: float (nullable = true)
 |-- dewpt: float (nullable = true)
 |-- vappr: float (nullable = true)
 |-- rhum: integer (nullable = true)
 |-- msl: float (nullable = true)
 |-- wdsp: integer (nullable = true)
 |-- wddir: integer (nullable = true)
 |-- sun: float (nullable = true)
 |-- vis: integer (nullable = true)
 |-- clht: integer (nullable = true)
 |-- clamt: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- season: string (nullable = true)
 |-- idx: long (nullable = false)



Let's convert the date to a string. It will be parsed back to the timestamp on the dashboard.

In [38]:
working_df = working_df.withColumn('date', F.col('date').cast('STRING')) # We can also use SQL notation in casting.

Let's convert the names to full names:

In [39]:
convert_dict = {
    'county': 'County',
    'station': 'Weather Station', 
    'latitude': 'Latitude',
    'longitude': 'Longitude',
    'date': 'Date',
    'rain': 'Rain (mm)',
    'temp': 'Temperature (°C)',
    'wetb': 'Wet Bulb Air Temperature (°C)',
    'dewpt': 'Dew Point',
    'vappr': 'Vapour Pressure (hPa)',
    'rhum': 'Relative Humidity (%)',
    'msl': 'Mean Sea Level Pressure (hPa)',
    'wdsp': 'Mean Hourly Wind Speed (kt)',
    'wddir': 'Predominant Hourly Wind Direction (degrees)',
    'sun': 'Sun (hours)',
    'vis': 'Visibility (m)',
    'clht': 'Cloud Ceiling Height',
    'clamt': 'Cloud Amount (Oktas)',
    'idx': 'Index',
    'day': 'Day', 
    'month': 'Month',
    'season': 'Season'}

In [40]:
output_df = working_df
for col in working_df.columns:
    output_df = output_df.withColumnRenamed(col, convert_dict[col])

In [41]:
output_df.printSchema()

root
 |-- County: string (nullable = true)
 |-- Weather Station: string (nullable = true)
 |-- Latitude: float (nullable = true)
 |-- Longitude: float (nullable = true)
 |-- Date: string (nullable = true)
 |-- Rain (mm): float (nullable = true)
 |-- Temperature (°C): float (nullable = true)
 |-- Wet Bulb Air Temperature (°C): float (nullable = true)
 |-- Dew Point: float (nullable = true)
 |-- Vapour Pressure (hPa): float (nullable = true)
 |-- Relative Humidity (%): integer (nullable = true)
 |-- Mean Sea Level Pressure (hPa): float (nullable = true)
 |-- Mean Hourly Wind Speed (kt): integer (nullable = true)
 |-- Predominant Hourly Wind Direction (degrees): integer (nullable = true)
 |-- Sun (hours): float (nullable = true)
 |-- Visability (m): integer (nullable = true)
 |-- Cloud Ceiling Height: integer (nullable = true)
 |-- Cloud Ammount (Oktas): integer (nullable = true)
 |-- Day: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- Season: string (nullable = true

That looks great! 
While we do not need this dataset to be in a long format, let's just have a quick look into pivoting.

### Pivoting

Pivoting is changing the data from a row-orientated to a column-orientated dataset (wide vs. narrow). The inverse (going from column to row-orientated) is known as unpivoting. In other words, it is changing from the narrow format into the wide format. 

For optimal visualisation in a bespoke application, it may be preferable to have the data in a narrow format, whereas, for data scientists, it may be most appropriate to have the data in a wide format. 
This will form part of the final formatting section of the data engineering process.

Pivoting data may also involve aggregations being performed on the data to get it into a more condensed format, such as performing means, summation, or getting the minimum/maximum. This is specifically true when creating a data warehouse for data analytics.


<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Processing%20Big%20Data/reshaping_pivot.png"
     alt="Pivoting"
     style="float: center; padding-bottom=0.5em"
     />
    <em>Figure 1. Pivoting.</em>
</div>

Let's look at an example from our dataset and try to get the mean temperature for each county in a pivoted table.

In [42]:
# We first use the groupby() method to group the data, followed by the pivot() method.
# Finally, we call an aggregation function.
# Select a subset of counties to improve the display.

output_df.groupBy('Date').pivot('County').avg('Temperature (°C)').select('Date', 'Carlow', 'Cork', 'Galway', 'Mayo', 'Wexford').show()

+-------------------+------+------------------+------+------------------+-------+
|               Date|Carlow|              Cork|Galway|              Mayo|Wexford|
+-------------------+------+------------------+------+------------------+-------+
|1990-02-05 13:00:00|  null|11.300000190734863|  null|10.600000381469727|   null|
|1990-03-11 12:00:00|  null|              11.5|  null|10.600000381469727|   null|
|1990-05-10 10:00:00|  null|11.599999904632568|  null|12.050000190734863|   null|
|1990-06-28 19:00:00|  null|12.950000286102295|  null|13.150000095367432|   null|
|1990-07-04 00:00:00|  null|13.050000190734863|  null|11.449999809265137|   null|
|1990-12-05 03:00:00|  null| 6.599999904632568|  null| 6.599999904632568|   null|
|1992-03-08 21:00:00|  null| 7.099999904632568|  null| 7.950000286102295|   null|
|1992-08-30 03:00:00|  null|10.199999809265137|  null|10.050000190734863|   null|
|1992-11-22 12:00:00|  null|              12.5|  null|12.150000095367432|   null|
|1993-04-13 18:0

That's a wrap!

## Conclusion

In this train, we looked at the second half of the process of performing a data engineering task from a processing perspective. We re-ingested, learned how to deal with incomplete data, and also looked at outputting to other teams.

We provided a lot of information on data cleaning and transformation that is necessary to be performed when doing data engineering and showed how it's implemented in Apache Spark. To stay relevant and maintain fluency, we recommend visiting the [Apache Spark documentation](https://spark.apache.org/docs/latest/) regularly for new functionality and feature releases. 
