# Data Engineering in Spark – I
© Explore Data Science Academy


<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Processing%20Big%20Data/spark.png"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     />

</div>

## Learning objectives

In this train, you'll learn how to implement common data engineering transformations. These include: 

  - extraction and parsing; 
  - translation and mapping, and 
  - filtering, aggregation, and summarisation.

## The data engineering process

Data engineering, at its core, is the foundational practice of preparing an environment in which we can build big data systems. While a large part of the data engineering process is concerned with building out the infrastructure to ingest data and creating pipelines that will transform the data in an automated way, one of the largest components in data engineering is the transformation and cleaning of datasets. 

In this train, we will take a single dataset through the typical data engineering process, starting with ingesting the data, transforming it, and outputting the data once done. We will touch on several common transformations that you need in your data engineering toolkit. 

Data transformation is the process of changing the structure, format, and content of a dataset. Each of these steps may have various goals:
- **Constructive**: adding, copying, replicating, and creating new fields.
- **Destructive**: deleting fields or records, and removing erroneous fields.
- **Aesthetic**: standardisation, typecasting, and pivoting.
- **Structural**: renaming, moving, joining, and combining fields.

In transforming the data, the quality and reliability of the data will be improved, reducing the possibility of errors occurring when working with the datasets, allowing data scientists and analysts to only think about domain complexity. 

It is important to keep in mind that it is computationally expensive to perform transformations, which can imply real monetary implications. High costs from cloud providers will result in a system administrator or your boss who is not too happy with you.
Also, carelessness or a lack of expertise can mean that you paint yourself into a corner when performing transformations, so be sure to plan your transformations out carefully to not break any current business processes or make life difficult for other data team members. 


## Extraction and parsing

Extraction is the process of ingesting the data from a source system (in this case, just a CSV from Kaggle).
Parsing, on the other hand, is transforming the data into an appropriate structure to allow compatibility with destination systems. This means changing data into the correct types for your destination and finally writing to a destination system. 



In [1]:
# Import Spark and some auxiliary functions and set up a SparkSession.

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

In this train, we will be using a dataset that can be [downloaded from Kaggle](https://www.kaggle.com/conorrot/irish-weather-hourly-data).

This dataset contains hourly weather data from 25 weather stations across Ireland, obtained from the Irish Meteorological Service, Met Éireann.

In [2]:
# Read into Spark using the SparkSession and DataFrameReader only with header inferred.
# Change the below location to match your current working directory.

raw_df = spark.read.csv('./data/weather_data/hrly_Irish_weather.csv', header=True)

In [3]:
raw_df.printSchema()

root
 |-- county: string (nullable = true)
 |-- station: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- date: string (nullable = true)
 |-- rain: string (nullable = true)
 |-- temp: string (nullable = true)
 |-- wetb: string (nullable = true)
 |-- dewpt: string (nullable = true)
 |-- vappr: string (nullable = true)
 |-- rhum: string (nullable = true)
 |-- msl: string (nullable = true)
 |-- wdsp: string (nullable = true)
 |-- wddir: string (nullable = true)
 |-- sun: string (nullable = true)
 |-- vis: string (nullable = true)
 |-- clht: string (nullable = true)
 |-- clamt: string (nullable = true)



The `printSchema()` method gives you a breakdown of the schema of the incoming data, detailing the field name, its type, and whether it is nullable or not.

From the data ingestion, we can see that there are 18 fields, all of which have been inferred as strings. Let's look at the data and decide on better types for each field.

We'll only look at non-null records using the `dropna()` method on the DataFrame to remove all null records.
Remember that the methods used on a Spark DataFrame do not change the DataFrame itself, but rather returns a copy of the DataFrame on which we use the `show()` method to visualise.


In [4]:
raw_df.dropna().show(10)

+------+---------+--------+---------+-----------------+----+----+----+-----+-----+----+------+----+-----+---+-----+----+-----+
|county|  station|latitude|longitude|             date|rain|temp|wetb|dewpt|vappr|rhum|   msl|wdsp|wddir|sun|  vis|clht|clamt|
+------+---------+--------+---------+-----------------+----+----+----+-----+-----+----+------+----+-----+---+-----+----+-----+
|  Mayo|BELMULLET|  54.228|  -10.007|01-jan-1990 00:00| 0.0| 7.6| 7.1|  6.5|  9.7|  93|1003.0|  10|  200|0.0|26000|  16|    8|
|  Mayo|BELMULLET|  54.228|  -10.007|01-jan-1990 01:00| 0.0| 7.6| 7.2|  6.7|  9.8|  94|1003.0|  10|  190|0.0|19000|  18|    8|
|  Mayo|BELMULLET|  54.228|  -10.007|01-jan-1990 02:00| 0.0| 7.8| 7.2|  6.5|  9.7|  91|1003.1|   8|  200|0.0|22000|  18|    7|
|  Mayo|BELMULLET|  54.228|  -10.007|01-jan-1990 03:00| 0.0| 7.7| 6.8|  5.7|  9.2|  87|1003.4|  12|  220|0.0|26000|  18|    6|
|  Mayo|BELMULLET|  54.228|  -10.007|01-jan-1990 04:00| 0.5| 7.5| 6.5|  5.3|  8.9|  86|1003.2|  13|  210|0.0|30

This does give us a good idea, but it's not good enough, yet. Let's use the `describe()` method to get summary statistics of the DataFrame. 

> 💡 &nbsp; **Did you know?**
>
> If the DataFrame has too many columns to display using the built-in Spark viewer, use the `toPandas()` method to view the DataFrame using the Pandas interface. It is important, however, to note that you should limit the size of the DataFrame in such a case as to not pull a massive DataFrame into memory, causing your computer to crash.

In [5]:
raw_df.describe().toPandas()

Unnamed: 0,summary,county,station,latitude,longitude,date,rain,temp,wetb,dewpt,vappr,rhum,msl,wdsp,wddir,sun,vis,clht,clamt
0,count,4660423,4660423,4660423.0,4660423.0,4660423,4660423.0,4660423.0,4660423.0,4660423.0,4660423.0,4660423.0,4660423.0,4431391.0,4431391.0,2075256.0,2075256.0,2075256.0,2075256.0
1,mean,,,53.25453260320943,-8.181231621900034,,0.1240679104054045,9.967762231294808,8.67506133623306,7.229586436130947,10.578773017025611,83.67801382669663,1013.255110896992,9.672500479374923,199.95195723173796,0.1565353468959093,26897.97343722213,264.2836716962836,5.822756711669644
2,stddev,,,0.9898850408372968,1.2206812648919267,,0.4845203822045314,4.711385132475347,4.31289799183682,4.473829797589463,3.1552922588702934,11.845824244118235,12.593850086665055,6.204022538237791,90.894298135227,0.3281719844755069,15665.079051528628,400.4829149145605,2.3707809390027625
3,min,Carlow,ATHENRY,51.476000000000006,-10.007,01-apr-1990 00:00,,,,,,,,,,,,,
4,max,Wexford,VALENTIA OBSERVATORY,55.372,-9.901,31-oct-2019 23:00,9.9,9.9,9.9,9.9,9.9,99.0,999.9,97.0,90.0,4.9,9000.0,999.0,9.0


From that, let's define the schema.

Many of the fields are numerical values. Spark has **Integer** types, **Decimal** types, and **Float** types, along with various others, which we have dealt with elsewhere. 

Parsing the DataFrame, we are not going to parse the date as a **timestamp** or **DateTime** just yet. As of the writing of this train, Spark does not have a very accurate **DateTime** or **timestamp** parser when reading DataFrames. As such, it is better to parse the **DateTime** where we can ensure that the coercion was done correctly. 

In [6]:
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, FloatType, TimestampType

weather_schema = StructType([
    StructField('county', StringType()),
    StructField('station', StringType()),
    StructField('latitude', FloatType()),
    StructField('longitude', FloatType()),
    StructField('date', StringType()), # While this is a date, it will likely not coerce correctly. Let's parse manually.
    StructField('rain', FloatType()),
    StructField('temp', FloatType()),
    StructField('wetb', FloatType()),
    StructField('dewpt', FloatType()),
    StructField('vappr', FloatType()),
    StructField('rhum', IntegerType()),
    StructField('msl', FloatType()),
    StructField('wdsp', IntegerType()),
    StructField('wddir', IntegerType()),
    StructField('sun', FloatType()),
    StructField('vis', IntegerType()),
    StructField('clht', IntegerType()),
    StructField('clamt', IntegerType())
])

In [7]:
# Ensure that the below path matches your current working directory.
typed_df = spark.read.csv('./data/weather_data/hrly_Irish_weather.csv', header=True, schema=weather_schema)

Let's parse the date column using the `to_timestamp()` function, which takes the field to convert as the first argument and the format as the second (refer to [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) for date conventions). 

This will allow us to use all kinds of fancy methods on the `date` column. 

In [8]:
typed_df = typed_df.withColumn('date', F.to_timestamp(F.col('date'), format="dd-MMM-yyy HH:mm"))

In [9]:
typed_df.printSchema()

root
 |-- county: string (nullable = true)
 |-- station: string (nullable = true)
 |-- latitude: float (nullable = true)
 |-- longitude: float (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- rain: float (nullable = true)
 |-- temp: float (nullable = true)
 |-- wetb: float (nullable = true)
 |-- dewpt: float (nullable = true)
 |-- vappr: float (nullable = true)
 |-- rhum: integer (nullable = true)
 |-- msl: float (nullable = true)
 |-- wdsp: integer (nullable = true)
 |-- wddir: integer (nullable = true)
 |-- sun: float (nullable = true)
 |-- vis: integer (nullable = true)
 |-- clht: integer (nullable = true)
 |-- clamt: integer (nullable = true)



In [10]:
typed_df.show(10)

+------+-------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|county|station|latitude|longitude|               date|rain|temp|wetb|dewpt|vappr|rhum|   msl|wdsp|wddir| sun| vis|clht|clamt|
+------+-------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|Galway|ATHENRY|  53.289|   -8.786|2011-06-26 01:00:00| 0.0|15.3|14.5| 13.9| 15.8|  90|1016.0|   8|  190|null|null|null| null|
|Galway|ATHENRY|  53.289|   -8.786|2011-06-26 02:00:00| 0.0|14.7|13.7| 12.9| 14.9|  89|1015.8|   7|  190|null|null|null| null|
|Galway|ATHENRY|  53.289|   -8.786|2011-06-26 03:00:00| 0.0|14.3|13.4| 12.6| 14.6|  89|1015.5|   6|  190|null|null|null| null|
|Galway|ATHENRY|  53.289|   -8.786|2011-06-26 04:00:00| 0.0|14.4|13.6| 12.8| 14.8|  90|1015.3|   7|  180|null|null|null| null|
|Galway|ATHENRY|  53.289|   -8.786|2011-06-26 05:00:00| 0.0|14.4|13.5| 12.7| 14.7|  89|1015.1|   6|  190|null|n

In [11]:
typed_df.describe().toPandas()

Unnamed: 0,summary,county,station,latitude,longitude,rain,temp,wetb,dewpt,vappr,rhum,msl,wdsp,wddir,sun,vis,clht,clamt
0,count,4660423,4660423,4660423.0,4660423.0,4548758.0,4627842.0,4615135.0,4616264.0,4472939.0,4496808.0,4587140.0,4349414.0,4334055.0,1843896.0,1805007.0,1843878.0,1843878.0
1,mean,,,53.25453252207271,-8.18123172288137,0.1240679109667971,9.967762233979489,8.675061338530968,7.229586438822406,10.578773020413784,83.67801382669663,1013.2551108660024,9.672500479374923,199.95195723173796,0.1565353473330961,26897.97343722213,264.2836716962836,5.822756711669644
2,stddev,,,0.9898850775903212,1.2206812705571315,0.4845203822875591,4.71138513295609,4.312897992278999,4.473829799427647,3.155292257692218,11.845824244118235,12.593850056415128,6.204022538237791,90.894298135227,0.3281719847264452,15665.079051528628,400.4829149145605,2.3707809390027625
3,min,Carlow,ATHENRY,51.476,-10.241,0.0,-17.3,-99.9,-92.4,0.0,-14.0,943.2,0.0,0.0,0.0,5.0,0.0,0.0
4,max,Wexford,VALENTIA OBSERVATORY,55.372,-6.241,41.4,31.5,24.9,23.8,29.5,100.0,1051.2,97.0,360.0,4.9,75000.0,999.0,9.0


Finally, let's output this to a destination. In this notebook, we will use `parquet` in the same directory. 

This has various advantages, the first of which is that it allows us to read it in again while maintaining the fields, field names, and other metadata.

In [13]:
# Ensure that the below path matches your current working directory.
typed_df.write.parquet('./data/weather_data/hrly_Irish_weather')

## Translation and mapping

Translation is the process of making the data more human-understandable, for example, changing work order codes in a field into human-readable descriptions of the work done. This process is important to ensure the data is understandable for a customer-facing product, but can also assist data scientists and analysts in the process of creating intelligence and analytics. 

Mapping is a similar process whereby we create a map between the source system and destination system, typically at the field level, for example, changing column names to conform to the destination system. This is especially relevant between different data systems (for example, JSON or XML to a relational database), where different constraints may exist on the field names. 

Mapping of datasets can be done through metadata, which explains the data fields and attributes and constitutes the data and rules that govern how the data is stored within the database or data repository. 

For this particular dataset, the abbreviations of the above field are:

- `rain` - Rain (mm) 
- `temp` - Temperature (°C) 
- `wetb` - Wet Bulb Air Temperature (°C) 
- `dewpt` - Dew Point
- `vappr` - Vapour Pressure (hPa) 
- `rhum` - Relative Humidity (%) 
- `msl` - Mean Sea Level Pressure (hPa) 
- `wdsp` - Mean Hourly Wind Speed (kt)
- `wddir` - Predominant Hourly Wind Direction (degrees)
- `sun` - Sun (hours) 
- `vis` - Visibility (m)
- `clht` - Cloud Ceiling Height 
- `clamt` - Cloud Amount (Oktas)

This will come in very handy when we perform data integrity checks. It may also help us confirm that we have done the typecasting correctly. We got this specifically from the metadata file that was included alongside the dataset.


## Filtering, aggregation, and summarisation

Part of the process of making data more manageable and consistent is filtering, aggregating, and summarising the data. 
This can be done by filtering out `null` values, invalid records, or irrelevant fields. A large part of this phase is the exploration of the dataset for such errors and challenges, a process that is analogous to a data scientist performing an [Exploratory Data Analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis).

It can also be done by aggregating data to a level that is more appropriate for analytical teams, or which will simplify the visualisation of datasets. For example, aggregating data up to a high geographical level (for visualising COVID-19 cases at the level of a country vs. a city) or aggregating time-series data up to the hour instead of the second. 

There is an almost inexhaustible list of errors that can be present within your dataset when performing these tasks. We will try to cover some of the most prominent issues that you may encounter:

- duplicated data;
- spurious observations;
- structural errors; and
- missing observations.

We will deal with the first three in this section and take a look at the fourth but only deal with it in the next section.

Now that we have the data in parquet format, we can easily read it in again without having to specify the schema again.

In [12]:
# Read the data in using the read.parquet() method from existing parquet files.

working_df = spark.read.parquet('./data/weather_data/hrly_Irish_weather/')

Make a list of categorical and continuous variables. This may be useful if we want to filter by these variable fields later.


In [13]:
weather_cat = ['county', 'station', 'date']
weather_int = ['rhum', 'wdsp', 'wddir', 'vis', 'clht', 'clamt']

# Create a list of continuous fields (fields that are not in the categorical list above).
weather_cont = [x for x in working_df.columns if x not in weather_cat]

# Create a list of float fields (those in the above list, but not in the integer list).
weather_float = [x for x in weather_cont if x not in weather_int] 

### 1. Duplicated data

A feature of uncleansed data may sometimes be the occurrence of duplicated data points. Removing these redundant pieces of data forms part of the filtering process.

The origin of the duplicates may be the result of an internal process (true duplicates) or due to a failure in the system, such as incorrect data capture or incorrect data processing. 

To combat such duplications, it is always important to check and remove duplicate entries within a dataset, and this is done through a deduplication process. 

Deduplication can be simple and is normally based on numerical entries or entries that are well indexed. An example of this might be transactional data in a point-of-sale system, which has a primary key but was processed incorrectly. 

Table 1. An example of duplicated records.

|Index|Date|Name|Value|
|--|--|--|--|
|1|2021-01-01|Ben|20|
|2|2021-01-01|Sarah|12|
|3|2021-01-01|Jack|24|
|3|2021-01-01|Jack|24|
|4|2021-01-01|Ben|20|

In the above table, we demonstrate the types of duplications that can occur. Rows with index 3 indicate two duplicates that likely resulted from a system failure, causing the entry to be duplicated within our system. We should remove this type of duplicate from our system. Rows with index 1 and 4 are true duplicates according to the above definition and are likely a manifestation of the underlying process. Ben probably went to the shop twice on the 1st of January 2021 and bought goods to the value of 20. We would want to keep these duplicates in our system as they likely represent real processes.

Alternatively, deduplication can be very complex, with the only unique identifier being a string column, in which case, fuzzy logic or some similar technique has to be applied to identify and remove duplicates. An example of this might be names that are in an HR list, some of which have been entered incorrectly, or consolidation between names in multiple systems. 

Let's start by checking if all entries within the datasets are unique.


In [14]:
working_df.groupby(working_df.columns) \
  .count() \
  .where('count > 1') \
  .sort('count', ascending=False) \
  .show()

+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+-----+
|county|station|latitude|longitude|date|rain|temp|wetb|dewpt|vappr|rhum|msl|wdsp|wddir|sun|vis|clht|clamt|count|
+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+-----+
+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+-----+



Here we are using the `groupby()` method with all the columns as argument, meaning we are grouping by all the columns. We then use the `count()` aggregation method, modify that with the `where()` method, and only retrieving rows where the count is larger than one. Finally, we sort by the count column using the `sort()` method and display the resulting DataFrame.

That looks about right... no duplicates across the whole dataset!

It does, however, feel like we expected that. It is also just more plausible, statistically. What if we try to create a compound primary index using the date and the station? That should still be unique across the dataset. We expect there to be no double entry per station at any one point in time.

In [15]:
working_df.groupby(['date', 'station']) \
  .count() \
  .where('count > 1') \
  .sort('count', ascending=False) \
  .show()

+----+-------+-----+
|date|station|count|
+----+-------+-----+
+----+-------+-----+



Here we are doing exactly the same as above, except we're only grouping by the `date` and `station` columns.

No duplicates... nice!

### 2. Spurious observations

Another feature of a dataset that has not yet gone through data cleansing may be the occurrence of spurious observations. 

Spurious observations may have many origins: humans performing data capture may enter incorrect data, an incorrect understanding of the input data may lead to incorrect entries (for example, 0 being entered as O), or automated data capture processes may lead to incorrect entries in a similar way (for example, an OCR system may read an S in as a 5). Other errors may be even more domain-specific (for example, an acoustic sewer level sensor may give negative readings when the water level is above the sensor, or when covered by a large object). It is clear that we have to make sure that all the values within the input data make sense within the bounded context of the domain we are working in. 

For our dataset, we have defined the field identities above. Let's revisit them and define possible bounds for them:

- `rain` – Rain (mm) – any positive decimal number.
- `temp` – Temperature (°C) – any decimal number above -273.15 (expect to be between -20 and 40).
- `wetb` – Wet Bulb Air Temperature (°C) – any decimal number above -273.15 (expect to be between -20 and 40).
- `dewpt` – Dew Point – any decimal number above -273.15 (expect to be between -20 and 40).
- `vappr` – Vapour Pressure (hPa) – any positive decimal number.
- `rhum` – Relative Humidity (%) – between 0 and 100.
- `msl` – Mean Sea Level Pressure (hPa) – any positive decimal number (expect close to 1,013.25).
- `wdsp` – Mean Hourly Wind Speed (kt) – any positive decimal number (maximum at 220 - fastest ever recorded).
- `wddir` – Predominant Hourly Wind Direction (degrees) – between 0 and 360.
- `sun` – Sun (hours) – between 0 and 24.
- `vis` – Visibility (m) – any positive decimal number.
- `clht` – Cloud Ceiling Height – between 0 and 999.
- `clamt` – Cloud Amount (Oktas) – between 0 and 9.

Let's quickly run through checks for all of the above:

#### 1. `rain` > 0

Rain should be positive. Let's see if we have any negative records:

In [16]:
working_df.where(F.col('rain') < 0).show()

+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+
|county|station|latitude|longitude|date|rain|temp|wetb|dewpt|vappr|rhum|msl|wdsp|wddir|sun|vis|clht|clamt|
+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+
+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+



#### 2. `temp` between -20 and 40

In [17]:
working_df.where(~F.col('temp').between(-20, 40)).show()

+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+
|county|station|latitude|longitude|date|rain|temp|wetb|dewpt|vappr|rhum|msl|wdsp|wddir|sun|vis|clht|clamt|
+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+
+------+-------+--------+---------+----+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+



But... I thought we are filtering records 'where the temp is between -20 and 40'. Why are we not getting anything?

This is where the `~` operator comes in, which means 'everything except'.

#### 3. `wetb` between -20 and 40


In [18]:
working_df.where(~F.col('wetb').between(-20, 40)).show(10)

+------+--------+--------+---------+-------------------+----+----+-----+-----+-----+----+------+----+-----+---+-----+----+-----+
|county| station|latitude|longitude|               date|rain|temp| wetb|dewpt|vappr|rhum|   msl|wdsp|wddir|sun|  vis|clht|clamt|
+------+--------+--------+---------+-------------------+----+----+-----+-----+-----+----+------+----+-----+---+-----+----+-----+
|Dublin|CASEMENT|  53.306|   -6.439|2017-08-01 00:00:00| 0.0|12.5|-49.0| 11.1| 13.3|  92|1009.2|   6|  210|0.0|20000|  70|    7|
|Dublin|CASEMENT|  53.306|   -6.439|2017-08-31 21:00:00| 0.0| 8.3|-49.0|  7.0| 10.0|  92|1021.4|   4|  210|0.0|30000| 999|    2|
|Dublin|CASEMENT|  53.306|   -6.439|2017-09-01 00:00:00| 0.0| 8.0|-49.0|  6.8|  9.8|  92|1022.5|   4|  220|0.0|20000| 999|    1|
|Dublin|CASEMENT|  53.306|   -6.439|2018-01-28 10:00:00| 0.0|11.7|-49.0| 10.3| 12.6|  91|1025.5|  20|  240|0.0| 9000|  11|    8|
|Dublin|CASEMENT|  53.306|   -6.439|2019-06-30 08:00:00| 0.0|16.1|-49.0| 12.6| 14.7|  80|1012.5| 

**Weird**... let's have a look at how many of these we have and how extreme they are:

In [19]:
working_df.where(~F.col('wetb').between(-20, 40)).groupBy('wetb', 'station').count().show()

+-----+---------------+-----+
| wetb|        station|count|
+-----+---------------+-----+
|-49.0|   ROCHES POINT|   99|
|-49.0|   CORK AIRPORT|   92|
|-99.9|    CLAREMORRIS|    9|
|-49.0|       CASEMENT|   86|
|-49.0|SHANNON AIRPORT|  103|
|-49.9| DUBLIN AIRPORT|    1|
+-----+---------------+-----+



That is quite a lot. It is also only three very specific values spread across many stations. Let's look at where they are located to see if we can find a pattern:


<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Processing%20Big%20Data/ireland-weather.png"
     alt="Ireland"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
    <p>Figure 1: A map of Ireland.</p>
</div>

They are not at all clustered or far north. We expect these to be default values that the loggers return if they are faulty.

We can note this and handoff to the modelling teams or remove these values.

#### 4. `dewpt` between -20 and 40

In [20]:
working_df.where(~F.col('dewpt').between(-20, 40)).show(10)

+------+-----------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|county|    station|latitude|longitude|               date|rain|temp|wetb|dewpt|vappr|rhum|   msl|wdsp|wddir| sun| vis|clht|clamt|
+------+-----------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|  Mayo|CLAREMORRIS|  53.711|   -8.993|2003-01-30 18:00:00| 0.0| 0.4|-4.0|-22.3|  1.0|  16|1030.6|   8|   10|null|null|null| null|
|  Mayo|CLAREMORRIS|  53.711|   -8.993|2003-01-30 19:00:00| 0.0|-0.2|-4.5|-22.9|  0.9|  16|1030.9|   5|  310|null|null|null| null|
|  Mayo|CLAREMORRIS|  53.711|   -8.993|2003-01-30 20:00:00| 0.0| 0.4|-4.0|-22.3|  1.0|  16|1031.3|   6|  320|null|null|null| null|
|  Mayo|CLAREMORRIS|  53.711|   -8.993|2003-01-30 21:00:00| 0.0|-0.2|-4.5|-22.8|  0.9|  16|1031.7|   5|  310|null|null|null| null|
|  Mayo|CLAREMORRIS|  53.711|   -8.993|2003-01-30 22:00:00| 0.0| 0.1|-4.2|-22.1|  1

In [21]:
working_df.where(~F.col('dewpt').between(-20, 40)).groupBy('dewpt', 'station').count().show()

+-----+-----------+-----+
|dewpt|    station|count|
+-----+-----------+-----+
|-21.9|CLAREMORRIS|    4|
|-22.9|CLAREMORRIS|    1|
|-21.7|CLAREMORRIS|    2|
|-22.1|CLAREMORRIS|    1|
|-21.8|CLAREMORRIS|    2|
|-22.8|CLAREMORRIS|    1|
|-22.7|CLAREMORRIS|    1|
|-22.2|CLAREMORRIS|    1|
|-22.3|CLAREMORRIS|    2|
|-21.3|CLAREMORRIS|    1|
|-92.4| MOORE PARK|   15|
+-----+-----------+-----+



We could have been wrong in our assumption of no readings lower than -20 since the readings at CLAREMORRIS seem to be legitimate readings. However, the MOORE PARK reading seems too low, and as above with `wetb`, we should remove these values.

Those were all relative examples, for example, checking if temperatures fell within a specific range.

Let's check for dependencies between columns. 
We would expect the dew point to be lower than the current temperature. Let's see if that is the case:

In [22]:
working_df.where(F.col('dewpt') > F.col('temp')).show(10)

+------+----------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|county|   station|latitude|longitude|               date|rain|temp|wetb|dewpt|vappr|rhum|   msl|wdsp|wddir| sun| vis|clht|clamt|
+------+----------+--------+---------+-------------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|Galway|   ATHENRY|  53.289|   -8.786|2012-12-10 06:00:00| 0.0|-3.0|-2.9| -2.9|  4.9| 100|1025.9|   2|  140|null|null|null| null|
|Galway|   ATHENRY|  53.289|   -8.786|2013-01-09 09:00:00| 0.0|-1.4|-1.3| -1.3|  5.5| 100|1022.9|   3|   90|null|null|null| null|
|Galway|   ATHENRY|  53.289|   -8.786|2013-01-15 08:00:00| 0.0|-1.7|-1.6| -1.6|  5.4| 100|1014.7|   1|  200|null|null|null| null|
|Galway|   ATHENRY|  53.289|   -8.786|2013-03-17 06:00:00| 0.0|-2.9|-2.8| -2.8|  5.0| 100| 991.4|   1|   40|null|null|null| null|
|Galway|   ATHENRY|  53.289|   -8.786|2013-03-28 05:00:00| 0.0|-1.8|-1.7| -1.7|  5.4| 100|

In [23]:
working_df.where(F.col('dewpt') > F.col('temp')).groupBy('county').count().show()

+---------+-----+
|   county|count|
+---------+-----+
|    Clare|    1|
|Roscommon|   31|
|   Dublin|  103|
|   Galway|    5|
|     Cork|  106|
|Tipperary|   10|
|     Mayo|   54|
|    Meath|    7|
|   Carlow|   24|
|Westmeath|   20|
|    Sligo|    2|
|    Kerry|    2|
|    Cavan|    4|
+---------+-----+



You can see that it's possible to do comparisons between columns as well.

It does not look like we can use the relationship between temperature and dew point to determine valid entries. If we look at the counties where there are dew points higher than the temperatures, there are quite a few. 

#### \[Exercise\]

As an exercise, complete the data checks for the remainder of the expected values:

#### 5. `vappr` a positive decimal

#### 6. `rhum` between 0 and 100

#### 7. `msl` any positive decimal number close to 1013.25

#### 8 `wdsp` any positive decimal number below 220

#### 9. `wddir` between 0 and 360

#### 10. `sun` between 0 and 24

#### 11. `vis` any positive decimal number

#### 12. `clht` integer between 0 and 999

#### 13. `clamt` integer between 0 and 9

#### List of errors:

1. `wetb` removes values where `wetb` == 49.0, 49.9 and 99.0
2. `dewpt` removes values where `dewpt` == 92.4
3. `rhum` removes values where `rhum` == -14
4. `clamt` removes null values in the `clamt` field

### 3. Structural errors

Lots of things can go wrong when working with large datasets that come from source systems over which you do not have full control (a common thing when working as a data engineer). 

Similar to spurious observations, upstream changes can lead to incorrect structures for our processing systems. While addressing structural errors usually happens at the `extraction and parsing` phase of the data engineering process, the diagnosis typically happens during this phase. 

Usually, structural errors are picked up as malformed columns or value types that are incorrect within the specific fields (for example, strings within numerical columns) or by manual inspection of string fields (for example, having a county name in the station field). 

To perform this check, we have to look at the dataset before Spark coerced any records into the structure/schema that we specified. To do that, we will refer back to the `raw_df`. Let's have a look again.

In [24]:
raw_df.show(5)

+------+-------+------------------+---------+-----------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|county|station|          latitude|longitude|             date|rain|temp|wetb|dewpt|vappr|rhum|   msl|wdsp|wddir| sun| vis|clht|clamt|
+------+-------+------------------+---------+-----------------+----+----+----+-----+-----+----+------+----+-----+----+----+----+-----+
|Galway|ATHENRY|53.288999999999994|   -8.786|26-jun-2011 01:00| 0.0|15.3|14.5| 13.9| 15.8|  90|1016.0|   8|  190|null|null|null| null|
|Galway|ATHENRY|53.288999999999994|   -8.786|26-jun-2011 02:00| 0.0|14.7|13.7| 12.9| 14.9|  89|1015.8|   7|  190|null|null|null| null|
|Galway|ATHENRY|53.288999999999994|   -8.786|26-jun-2011 03:00| 0.0|14.3|13.4| 12.6| 14.6|  89|1015.5|   6|  190|null|null|null| null|
|Galway|ATHENRY|53.288999999999994|   -8.786|26-jun-2011 04:00| 0.0|14.4|13.6| 12.8| 14.8|  90|1015.3|   7|  180|null|null|null| null|
|Galway|ATHENRY|53.288999999999994|   -8.786|26-jun-201

Let's have a look at the entries in the string columns:

In [25]:
raw_df.groupBy('county').count().show()

+---------+------+
|   county| count|
+---------+------+
|    Clare|266617|
|  Wexford|135192|
|Roscommon|108192|
|   Dublin|653378|
|  Donegal|460028|
|   Galway|216792|
|     Cork|802513|
|Tipperary|108144|
|     Mayo|877279|
|    Meath|108887|
|   Carlow|147576|
|Westmeath|266617|
|    Sligo|108888|
|    Kerry|266616|
|    Cavan|133704|
+---------+------+



All the values in the county field are valid county names. Let's look at the stations:

In [26]:
raw_df.groupBy('station').count().show(100)

+--------------------+------+
|             station| count|
+--------------------+------+
|       KNOCK AIRPORT|208938|
|         CLAREMORRIS|266580|
|            OAK PARK|147576|
|        CORK AIRPORT|266617|
|              FINNER|193411|
|             GURTEEN|108144|
|      DUBLIN AIRPORT|266617|
|           BELMULLET|266617|
|            CASEMENT|266617|
|     SHANNON AIRPORT|266617|
|             ATHENRY| 78312|
|          BALLYHAISE|133704|
|VALENTIA OBSERVATORY|266616|
|             DUNSANY|108887|
|        PHOENIX PARK|120144|
|           MACE HEAD|138480|
|        ROCHES POINT|247296|
|             NEWPORT|135144|
|         JOHNSTOWNII|135192|
|           MULLINGAR|266617|
|           MT DILLON|108192|
|       SherkinIsland|141024|
|          MALIN HEAD|266617|
|             MARKREE|108888|
|          MOORE PARK|147576|
+--------------------+------+



It all looks fine! All the values in the station field are valid station names.

When casting values, Spark does not fail or result in silent overflows when encountering an incompatible type. It rather coerces the value into `NULL`. Thus, we can simply find the incorrectly casted values by casting and then counting the `NULL` values. 

For this, it is important to know how many null values there were before casting and then compare that with after casting. 

In the code cell below, we first count the number of NULL values before casting to Integer or Float type. We then cast to the specific type and perform a NULL count again on the resulting DataFrame. We finally do a difference between before and after casting NULL count to determine how many values were incorrectly cast.

***Note:*** This code cell might take some time to execute based on your computing environment. 

In [27]:
# Create a dictionary to hold failures. 
failures = {}
# Create a copy of the raw DataFrame.
test_df = raw_df

# Loop through columns which hold integer values.
for col in weather_int:
    # Count the number of nulls before type casting. The isnull() function checks if a value is NULL.
    # The when() function returns only values that satisfy the specified condition.
    # The count() function returns the count for the DataFrame returned by the when() function.
    # We then assign it to an alias using the alias() method.
    # We finally return the result using the collect() method.
    before_count = test_df.select(F.count(F.when(F.isnull(col), col)).alias(col)).collect() 
    
    # Cast to Integer type, creating a new column using the withColumn() method.
    # We use the cast() method to cast and use the IntegerType defined in Spark.
    test_df = test_df.withColumn(f'test_type_{col}', F.col(col).cast(IntegerType()))
    
    # Count the number of nulls after type casting, similar to the before count.
    after_count = test_df.select(F.count(F.when(F.isnull(f'test_type_{col}'), f'test_type_{col}')).alias(f'test_type_{col}')).collect() 
    
    # Get the difference between before and after. We need to extract the value from a list 
    # which the collect() method returns, and then get the first element in the row.
    failures[col] = after_count[0][0] - before_count[0][0] 
    
# Loop through columns which hold float values.
for col in weather_float:
    # Count the number of nulls before type casting, similar to the above counts.
    before_count = test_df.select(F.count(F.when(F.isnull(col), col)).alias(col)).collect() 
    
    # Cast to Float type, creating a new column using the withColumn() method.
    # We use the cast() method to cast and use the FloatType defined in Spark.
    test_df = test_df.withColumn(f'test_type_{col}', F.col(col).cast(FloatType())) 
    
    # Count the number of nulls after type casting, similar to the above counts.
    after_count = test_df.select(F.count(F.when(F.isnull(f'test_type_{col}'), f'test_type_{col}')).alias(f'test_type_{col}')).collect() 
    
    # Get the difference between before and after.
    failures[col] = after_count[0][0] - before_count[0][0] 

In the above code, we use the collect method. This method gathers all the data from the execution nodes and creates a single DataFrame on the driver node, which will hold all the data locally and on which operations can be performed.

In [28]:
failures

{'rhum': 163615,
 'wdsp': 81977,
 'wddir': 97336,
 'vis': 270249,
 'clht': 231378,
 'clamt': 231378,
 'latitude': 0,
 'longitude': 0,
 'rain': 111665,
 'temp': 32581,
 'wetb': 45288,
 'dewpt': 44159,
 'vappr': 187484,
 'msl': 73283,
 'sun': 231360}

Let's have a closer look at the values that are being cast incorrectly. 
Here we want to group by values that are NULL after the casting, and not NULL before the casting. That is exactly what we are doing below, and aggregating by count to see how many of each instance we have.

In [29]:
# For each of the continuous variable columns.
for col in weather_cont:
    # Again, we are using the isnull() function to find values that are null.
    # We use the where() method to only select records that match our conditions.
    # Our conditions are where the casted column is null and the raw df is not null (we use the '~' operator which signifies not).
    # Similar to how we aggregate using count and return each DataFrame to standard out.
    test_df.where(F.isnull(test_df[f'test_type_{col}']) & ~F.isnull(raw_df[col])).groupby(col, f'test_type_{col}').count().show()

+--------+------------------+-----+
|latitude|test_type_latitude|count|
+--------+------------------+-----+
+--------+------------------+-----+

+---------+-------------------+-----+
|longitude|test_type_longitude|count|
+---------+-------------------+-----+
+---------+-------------------+-----+

+----+--------------+------+
|rain|test_type_rain| count|
+----+--------------+------+
|    |          null|111665|
+----+--------------+------+

+----+--------------+-----+
|temp|test_type_temp|count|
+----+--------------+-----+
|    |          null|32581|
+----+--------------+-----+

+----+--------------+-----+
|wetb|test_type_wetb|count|
+----+--------------+-----+
|    |          null|45288|
+----+--------------+-----+

+-----+---------------+-----+
|dewpt|test_type_dewpt|count|
+-----+---------------+-----+
|     |           null|44159|
+-----+---------------+-----+

+-----+---------------+------+
|vappr|test_type_vappr| count|
+-----+---------------+------+
|     |           null|187484|

Looks like we found the culprit. It's empty columns that are incorrectly cast to nulls (those are the values in each of the first cells). We can leave these fields as is, as they represent true missing values, and we will deal with them in the next section when we deal with missing data. 

Brilliant! It looks like, for the majority of cases, we do not have any issues in the structure of our dataset and its parsed version in Spark. We can move along.

### 4. Missing observations

Let's do a quick check for missing data before we dive into the next section.

The reason for bringing it up here is that missing data is a feature of the dataset that we have to deal with or justify. When we find missing data, we have three options for dealing with it:

- Filter out the missing data.
- Impute or supplement the field (discussed in the next section).
- Note the caveat and hand over the data to other analytical teams to deal with the missing data or apply more advanced imputation techniques.

Let's see how much missing data we have and how we would remove missing data.

First, we check for NULL values:

In [30]:
working_df.select([F.count(F.when(F.isnull(x), x)).alias(x) for x in raw_df.columns if x not in weather_cat]).show()

+--------+---------+------+-----+-----+-----+------+------+-----+------+------+-------+-------+-------+-------+
|latitude|longitude|  rain| temp| wetb|dewpt| vappr|  rhum|  msl|  wdsp| wddir|    sun|    vis|   clht|  clamt|
+--------+---------+------+-----+-----+-----+------+------+-----+------+------+-------+-------+-------+-------+
|       0|        0|111665|32581|45288|44159|187484|163615|73283|311009|326368|2816527|2855416|2816545|2816545|
+--------+---------+------+-----+-----+-----+------+------+-----+------+------+-------+-------+-------+-------+



We also have to check for NaN values:

In [31]:
working_df.select([F.count(F.when(F.isnan(x), x)).alias(x) for x in raw_df.columns if x not in weather_cat]).show()

+--------+---------+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+
|latitude|longitude|rain|temp|wetb|dewpt|vappr|rhum|msl|wdsp|wddir|sun|vis|clht|clamt|
+--------+---------+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+
|       0|        0|   0|   0|   0|    0|    0|   0|  0|   0|    0|  0|  0|   0|    0|
+--------+---------+----+----+----+-----+-----+----+---+----+-----+---+---+----+-----+



Looks like we have quite a number of rows with missing data. But, luckily, it is only Null values and not NaNs too (two different variants of empty values in Spark), which we will cover later in this train. 

## Conclusion

In this train, we had a look at the first half of the process of performing a data engineering task from a processing perspective. We ingested, created a master table to work from, and performed some filters and checks on the data. In the next train, we are going to continue the data engineering process by actually cleaning the data, performing imputation, additional enrichments, and producing the appropriate output.
