# COMP.CS.320 Data-Intensive Programming, Exercise 2

This exercise contains basic tasks of data processing using Spark and DataFrames. The tasks can be done in either Scala or Python. This is the **Python** version, switch to the Scala version if you want to do the tasks in Scala.

Each task has its own cell for the code. Add your solutions to the cells. You are free to add more cells if you feel it is necessary. There are cells with example outputs or test code following most of the tasks.

Don't forget to submit your solutions to Moodle.

In [0]:
# some imports that might be required in the tasks

from typing import List
from pyspark.sql import functions
from pyspark.sql import DataFrame
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql.functions import min, max, count, avg, round
from pyspark.sql.functions import col, year


## Task 1 - Create DataFrame

As mentioned in the tutorial notebook, Azure Storage Account and Azure Data Lake Storage Gen2 are used in the course to provide a place to read and write data files.
In the [Shared container](https://portal.azure.com/#view/Microsoft_Azure_Storage/ContainerMenuBlade/~/overview/storageAccountId/%2Fsubscriptions%2Fe0c78478-e7f8-429c-a25f-015eae9f54bb%2FresourceGroups%2Ftuni-cs320-f2023-rg%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2Ftunics320f2023gen2/path/shared/etag/%220x8DBB0695B02FFFE%22/defaultEncryptionScope/%24account-encryption-key/denyEncryptionScopeOverride~/false/defaultId//publicAccessVal/None) in the `exercises/ex2` folder is file `rdu-weather-history.csv` that contains weather data in CSV format.
The direct address for the data file is: `abfss://shared@tunics320f2023gen2.dfs.core.windows.net/exercises/ex2/rdu-weather-history.csv`

Read the data from the CSV file into DataFrame called weatherDataFrame. Let Spark infer the schema for the data.

Print out the schema.
Study the schema and compare it to the data in the CSV file. Do they match?


In [0]:
from pyspark.sql import SparkSession, DataFrame
file_path = "abfss://shared@tunics320f2023gen2.dfs.core.windows.net/exercises/ex2/rdu-weather-history.csv"
weatherDataFrame: DataFrame = spark.read.csv(file_path, header=True, inferSchema=True)

# code that prints out the schema for weatherDataFrame
weatherDataFrame.printSchema()



root
 |-- date: date (nullable = true)
 |-- temperaturemin: double (nullable = true)
 |-- temperaturemax: double (nullable = true)
 |-- precipitation: double (nullable = true)
 |-- snowfall: double (nullable = true)
 |-- snowdepth: double (nullable = true)
 |-- avgwindspeed: double (nullable = true)
 |-- fastest2minwinddir: integer (nullable = true)
 |-- fastest2minwindspeed: double (nullable = true)
 |-- fastest5secwinddir: integer (nullable = true)
 |-- fastest5secNo: double (nullable = true)
 |-- windspeed: string (nullable = true)
 |-- fog: string (nullable = true)
 |-- fogheavy: string (nullable = true)
 |-- mist: string (nullable = true)
 |-- rain: string (nullable = true)
 |-- fogground: string (nullable = true)
 |-- ice: string (nullable = true)
 |-- glaze: string (nullable = true)
 |-- drizzle: string (nullable = true)
 |-- snow: string (nullable = true)
 |-- freezingrain: string (nullable = true)
 |-- smokehaze: string (nullable = true)
 |-- thunder: string (nullable = true)


Example output for task 1 (only the first few lines):

```text
root
 |-- date: date (nullable = true)
 |-- temperaturemin: double (nullable = true)
 |-- temperaturemax: double (nullable = true)
 |-- precipitation: double (nullable = true)
 |-- snowfall: double (nullable = true)
 |-- snowdepth: double (nullable = true)
 |-- avgwindspeed: double (nullable = true)
 ...
 ```

## Task 2 - The first items from DataFrame

Fetch the first **five** rows of the weather dataframe and print their contents. You can use the DataFrame variable from task 1.

In [0]:
weatherSample: List[Row] = weatherDataFrame.head(5)

print(*[list(row.asDict().values()) for row in weatherSample], sep="\n")  # prints each Row to its own line


[datetime.date(2008, 5, 20), 57.9, 82.9, 0.43, 0.0, 0.0, 10.51, 230, 25.05, 220, 31.99, 'Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'No']
[datetime.date(2008, 5, 22), 48.0, 78.1, 0.0, 0.0, 0.0, 4.03, 230, 16.11, 280, 21.03, 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No']
[datetime.date(2008, 5, 23), 52.0, 79.0, 0.0, 0.0, 0.0, 4.7, 70, 10.07, 100, 14.99, 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No']
[datetime.date(2008, 6, 7), 73.9, 100.0, 0.0, 0.0, 0.0, 5.59, 230, 16.11, 220, 21.92, 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No']
[datetime.date(2008, 6, 22), 64.9, 87.1, 0.93, 0.0, 0.0, 6.93, 200, 23.04, 200, 29.97, 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No']


Example output for task 2:

```text
[datetime.date(2008, 5, 20), 57.9, 82.9, 0.43, 0.0, 0.0, 10.51, 230, 25.05, 220, 31.99, 'Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'No', 'Yes', 'No', 'No', 'No']
[datetime.date(2008, 5, 22), 48.0, 78.1, 0.0, 0.0, 0.0, 4.03, 230, 16.11, 280, 21.03, 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No']
[datetime.date(2008, 5, 23), 52.0, 79.0, 0.0, 0.0, 0.0, 4.7, 70, 10.07, 100, 14.99, 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No']
[datetime.date(2008, 6, 7), 73.9, 100.0, 0.0, 0.0, 0.0, 5.59, 230, 16.11, 220, 21.92, 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No']
[datetime.date(2008, 6, 22), 64.9, 87.1, 0.93, 0.0, 0.0, 6.93, 200, 23.04, 200, 29.97, 'Yes', 'No', 'No', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No']
```

## Task 3 - Minimum and maximum

Find the minimum temperature and the maximum temperature from the whole data.

In [0]:
weatherDataFrame2: DataFrame = weatherDataFrame

minTemp: float = weatherDataFrame.selectExpr("min(temperaturemin)").collect()[0][0]
maxTemp: float = weatherDataFrame.selectExpr("max(temperaturemax)").collect()[0][0]

print(f"Min temperature is {minTemp}")
print(f"Max temperature is {maxTemp}")


Min temperature is 4.1
Max temperature is 105.1


In [0]:
if 4.05 < minTemp < 4.15:
    print("correct result: minimum temperature is 4.1 °F (-15,5 °C)")
else:
    print(f"wrong result: {minTemp} != 4.1")

if 105.05 < maxTemp < 105.15:
    print("correct result: maximum temperature is 105.1 °F (40.6 °C)")
else:
    print(f"wrong result: {maxTemp} != 105.1")


correct result: minimum temperature is 4.1 °F (-15,5 °C)
correct result: maximum temperature is 105.1 °F (40.6 °C)


## Task 4 - Adding a column

Add a new column `year` to the weatherDataFrame and print out the schema for the new DataFrame.

The type of the new column should be integer and value calculated from column `date`.
You can use function `functions.year` from `pyspark.sql`

See documentation: [https://spark.apache.org/docs/3.4.1/api/python/reference/pyspark.sql/api/pyspark.sql.functions.year.html#pyspark.sql.functions.year](https://spark.apache.org/docs/3.4.1/api/python/reference/pyspark.sql/api/pyspark.sql.functions.year.html#pyspark.sql.functions.year)


In [0]:
from pyspark.sql.functions import year

weatherDataFrameWithYear: DataFrame = weatherDataFrame.withColumn("year", year(weatherDataFrame["date"]))

# code that prints out the schema for weatherDataFrameWithYear
weatherDataFrameWithYear.printSchema()
#weatherDataFrameWithYear.tail(5)

root
 |-- date: date (nullable = true)
 |-- temperaturemin: double (nullable = true)
 |-- temperaturemax: double (nullable = true)
 |-- precipitation: double (nullable = true)
 |-- snowfall: double (nullable = true)
 |-- snowdepth: double (nullable = true)
 |-- avgwindspeed: double (nullable = true)
 |-- fastest2minwinddir: integer (nullable = true)
 |-- fastest2minwindspeed: double (nullable = true)
 |-- fastest5secwinddir: integer (nullable = true)
 |-- fastest5secNo: double (nullable = true)
 |-- windspeed: string (nullable = true)
 |-- fog: string (nullable = true)
 |-- fogheavy: string (nullable = true)
 |-- mist: string (nullable = true)
 |-- rain: string (nullable = true)
 |-- fogground: string (nullable = true)
 |-- ice: string (nullable = true)
 |-- glaze: string (nullable = true)
 |-- drizzle: string (nullable = true)
 |-- snow: string (nullable = true)
 |-- freezingrain: string (nullable = true)
 |-- smokehaze: string (nullable = true)
 |-- thunder: string (nullable = true)


Example output for task 4 (only the last few lines):

```text
...
 |-- highwind: string (nullable = true)
 |-- hail: string (nullable = true)
 |-- blowingsnow: string (nullable = true)
 |-- dust: string (nullable = true)
 |-- year: integer (nullable = true)
```


## Task 5 - Aggregated DataFrame 1

Find the minimum and the maximum temperature for each year.

Sort the resulting DataFrame based on year so that the latest year is in the first row and the earliest year is in the last row.


In [0]:
aggregatedDF:DataFrame = weatherDataFrameWithYear

aggregatedDF= aggregatedDF.groupBy("year").agg(min("temperaturemin").alias("min_temperature"),
                                              max("temperaturemax").alias("max_temperature"))

aggregatedDF = aggregatedDF.orderBy(aggregatedDF["year"].desc())
aggregatedDF.show()


+----+---------------+---------------+
|year|min_temperature|max_temperature|
+----+---------------+---------------+
|2018|            4.1|           98.1|
|2017|            9.1|          102.0|
|2016|           15.3|           99.0|
|2015|            7.2|          100.0|
|2014|            7.2|           98.1|
|2013|           18.0|           96.1|
|2012|           19.0|          105.1|
|2011|           16.0|          104.0|
|2010|           15.1|          102.0|
|2009|           10.9|           99.0|
|2008|           15.1|          100.9|
|2007|           15.1|          105.1|
+----+---------------+---------------+



Example output for task 5:

```text
+----+---------------+---------------+
|year|min_temperature|max_temperature|
+----+---------------+---------------+
|2018|            4.1|           98.1|
|2017|            9.1|          102.0|
|2016|           15.3|           99.0|
|2015|            7.2|          100.0|
|2014|            7.2|           98.1|
|2013|           18.0|           96.1|
|2012|           19.0|          105.1|
|2011|           16.0|          104.0|
|2010|           15.1|          102.0|
|2009|           10.9|           99.0|
|2008|           15.1|          100.9|
|2007|           15.1|          105.1|
+----+---------------+---------------+
```


## Task 6 - Aggregated DataFrame 2

Expanding from task 5, create a DataFrame that contains the following for each year:

- the minimum temperature
- the maximum temperature
- the number of entries (as in rows in the original data) there are for that year
- the average wind speed (rounded to 2 decimal precision)


In [0]:
task6DF: DataFrame = weatherDataFrameWithYear

task6DF = task6DF.groupBy("year").agg(
    min("temperaturemin").alias("min_temperature"),
    max("temperaturemax").alias("max_temperature"),
    count("*").alias("entries"),
    round(avg("avgwindspeed"),2).alias("avg_windspeed") )

task6DF.show()


+----+---------------+---------------+-------+-------------+
|year|min_temperature|max_temperature|entries|avg_windspeed|
+----+---------------+---------------+-------+-------------+
|2007|           15.1|          105.1|    365|         6.14|
|2018|            4.1|           98.1|    228|         6.55|
|2015|            7.2|          100.0|    365|         5.44|
|2013|           18.0|           96.1|    365|         5.51|
|2014|            7.2|           98.1|    365|         5.56|
|2012|           19.0|          105.1|    366|         5.41|
|2009|           10.9|           99.0|    365|         6.13|
|2016|           15.3|           99.0|    366|         5.78|
|2010|           15.1|          102.0|    365|         5.49|
|2011|           16.0|          104.0|    365|         5.84|
|2008|           15.1|          100.9|    366|         6.49|
|2017|            9.1|          102.0|    365|         6.25|
+----+---------------+---------------+-------+-------------+



Example output for task 6:

```text
+----+---------------+---------------+-------+-------------+
|year|min_temperature|max_temperature|entries|avg_windspeed|
+----+---------------+---------------+-------+-------------+
|2007|           15.1|          105.1|    365|         6.14|
|2018|            4.1|           98.1|    228|         6.55|
|2015|            7.2|          100.0|    365|         5.44|
|2013|           18.0|           96.1|    365|         5.51|
|2014|            7.2|           98.1|    365|         5.56|
|2012|           19.0|          105.1|    366|         5.41|
|2009|           10.9|           99.0|    365|         6.13|
|2016|           15.3|           99.0|    366|         5.78|
|2010|           15.1|          102.0|    365|         5.49|
|2011|           16.0|          104.0|    365|         5.84|
|2008|           15.1|          100.9|    366|         6.49|
|2017|            9.1|          102.0|    365|         6.25|
+----+---------------+---------------+-------+-------------+
```


## Task 7 - Aggregated DataFrame 3

Using the DataFrame created in task 6, `task6DF`, find the following values:

- the minimum temperature for year 2012
- the maximum temperature for year 2016
- the number of entries for year 2018
- the average wind speed for year 2008


In [0]:
task7a:DataFrame= weatherDataFrameWithYear 
task7b:DataFrame= weatherDataFrameWithYear
task7c:DataFrame= weatherDataFrameWithYear
task7d:DataFrame= weatherDataFrameWithYear

task7a= task7a.filter(task7a["year"] == 2012)
task7b= task7b.filter(task7b["year"] == 2016)
task7c= task7c.filter(task7c["year"] == 2018)
task7d= task7d.filter(task7d["year"] == 2008)

min2012: float = task7a.select(min("temperaturemin")).collect()[0][0]
max2016: float = task7b.select(max("temperaturemax")).collect()[0][0]
entries2018: int = task7c.agg(count("*")).collect()[0][0]
wind2008: float = task7d.agg(round(avg("avgwindspeed"),2)).collect()[0][0]


In [0]:
if 18.95 < min2012 < 19.05:
    print("correct result: minimum temperature for year 2012 19.0 °F")
else:
    print(f"wrong result: {min2012} != 19.0")

if 98.95 < max2016 < 99.05:
    print("correct result: maximum temperature for year 2016 is 99.0 °F")
else:
    print(f"wrong result: {max2016} != 99.0")

if entries2018 == 228:
    print("correct result: there are 228 entries for year 2018")
else:
    print(f"wrong result: {entries2018} != 228")

if 6.485 < wind2008 < 6.495:
    print("correct result: average wind speed for year 2008 is 6.49")
else:
    print(f"wrong result: {wind2008} != 6.49")


correct result: minimum temperature for year 2012 19.0 °F
correct result: maximum temperature for year 2016 is 99.0 °F
correct result: there are 228 entries for year 2018
correct result: average wind speed for year 2008 is 6.49


## Task 8 - One additional aggregated DataFrame

Find the year that has the highest number of days that had fog.

Note, days that have been marked as `heavyfog` days but not as `fog` should not be counted.


In [0]:
#checking out the data theme for fog and heavyFog

fog_counts = weatherDataFrame.groupBy("fog").count().orderBy("count", ascending=False)
print("FOG data")
fog_counts.show()

fogheavy_counts = weatherDataFrame.groupBy("fogheavy").count().orderBy("count", ascending=False)
print("Heavy FOG data")
fogheavy_counts.show()

FOG data
+---+-----+
|fog|count|
+---+-----+
| No| 4011|
|Yes|  235|
+---+-----+

Heavy FOG data
+--------+-----+
|fogheavy|count|
+--------+-----+
|      No| 3464|
|     Yes|  782|
+--------+-----+



In [0]:
task8:DataFrame = weatherDataFrameWithYear

fogDays = task8.filter((task8["fog"] == "Yes") & (task8["fogheavy"] == "No"))

fogByYear = fogDays.groupBy("year").agg(count("*").alias("fogCount"))
maxDays = fogByYear.orderBy(fogByYear["fogCount"].desc()).first()

yearWithMostDaysWithFog: int = maxDays["year"]



In [0]:
if yearWithMostDaysWithFog == 2015:
    print("correct result: year 2015 had the highest number of days with fog (32)")
else:
    print(f"wrong result: {yearWithMostDaysWithFog} != 2015")


correct result: year 2015 had the highest number of days with fog (32)
