The objective of this project is to go through ETL process using PySpark and prepare a simple Data Warehouse for food item prices tracking in XXI century Poland. The dataset is free to download from FAOSTAT webpage: https://www.fao.org/faostat/en/#data/PP.

In [10]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FoodPrices").getOrCreate()

After creating a spark session, let's have a look at the csv file.

In [11]:
df = spark.read.csv('data/Prices_E_Europe_NOFLAG.csv', header=True, inferSchema=True)
df.show()

AnalysisException: [PATH_NOT_FOUND] Path does not exist: file:/c:/proggrind/FoodPrices/data/data/Prices_E_Europe_NOFLAG.csv. SQLSTATE: 42K03

As a result of an ETL process we want to get a specific information about food prices in Poland in specific months during XXI century.
Let's start filtering our dataframe.

In [None]:
from pyspark.sql.functions import col

df = df.filter(col('Area') == 'Poland')
df = df.filter(col('Months') != 'Annual value')

df = df.filter(col('Unit') == 'LCU') # tylko rekordy z cenami w lokalnej walucie -> PLN

year_columns = [c for c in df.columns 
                 if c.startswith("Y20") or c.startswith("Y21")] # don't take years before 2000

df = df.select(*["Item Code", "Item", "Months"], *year_columns)

We are leaving only necessary columns for further analysis (for example leaving area columns since all would have the same value).
Let's see the result top 5 rows.

In [None]:
df.head(5)

[Row(Item Code=44, Item='Barley', Months='January', Y2000=None, Y2001=None, Y2002=None, Y2003=None, Y2004=None, Y2005=None, Y2006=None, Y2007=None, Y2008=None, Y2009=None, Y2010=429.0, Y2011=778.0, Y2012=789.0, Y2013=885.0, Y2014=772.0, Y2015=632.0, Y2016=633.0, Y2017=600.0, Y2018=676.0, Y2019=818.0, Y2020=711.0, Y2021=717.0, Y2022=1104.0, Y2023=1258.0, Y2024=None),
 Row(Item Code=44, Item='Barley', Months='February', Y2000=None, Y2001=None, Y2002=None, Y2003=None, Y2004=None, Y2005=None, Y2006=None, Y2007=None, Y2008=None, Y2009=None, Y2010=418.0, Y2011=756.0, Y2012=806.0, Y2013=862.0, Y2014=734.0, Y2015=619.0, Y2016=611.0, Y2017=603.0, Y2018=673.0, Y2019=832.0, Y2020=676.0, Y2021=759.0, Y2022=1116.0, Y2023=1154.0, Y2024=None),
 Row(Item Code=44, Item='Barley', Months='March', Y2000=None, Y2001=None, Y2002=None, Y2003=None, Y2004=None, Y2005=None, Y2006=None, Y2007=None, Y2008=None, Y2009=None, Y2010=415.0, Y2011=806.0, Y2012=816.0, Y2013=841.0, Y2014=735.0, Y2015=606.0, Y2016=598.0, 

Having an output dataframe, let's limit number of columns by pivoting - creating different row for every year with corresponding month. This way we change an orientation of the table from wider to longer.

In [None]:
pivot_expression = "stack({0}, {1}) as (Year, Price)".format(
    len(year_columns),
    ", ".join([f"'{y[1:]}', `{y}`" for y in year_columns])  # Delete leading YXXXX in the 'year'
)

df = df.selectExpr("*", pivot_expression).drop(*year_columns) 
df.head(5)

[Row(Item Code=44, Item='Barley', Months='January', Year='2000', Price=None),
 Row(Item Code=44, Item='Barley', Months='January', Year='2001', Price=None),
 Row(Item Code=44, Item='Barley', Months='January', Year='2002', Price=None),
 Row(Item Code=44, Item='Barley', Months='January', Year='2003', Price=None),
 Row(Item Code=44, Item='Barley', Months='January', Year='2004', Price=None)]

We can see that after the transformation we have some rows where 'Price'=None. Since this is a crucial value for our analysis, we can get rid of the records that do not have it - there is no sense keeping the records without measures.

In [None]:
df = df.filter(col("Price").isNotNull())
df.head(5)

[Row(Item Code=44, Item='Barley', Months='January', Year='2010', Price=429.0),
 Row(Item Code=44, Item='Barley', Months='January', Year='2011', Price=778.0),
 Row(Item Code=44, Item='Barley', Months='January', Year='2012', Price=789.0),
 Row(Item Code=44, Item='Barley', Months='January', Year='2013', Price=885.0),
 Row(Item Code=44, Item='Barley', Months='January', Year='2014', Price=772.0)]

Now, having prepared our data in this way we can proceed to model it for a simple data warehouse.
Our fact table will consist of price value, while dimensions will be time(date) and item.

Let's start from dimension tables.

In [None]:
dim_item = df.select("Item Code", "Item").distinct() \
    .withColumnRenamed("Item Code", "id") \
    .withColumnRenamed("Item", "name")

dim_item.head(5)

[Row(id=15, name='Wheat'),
 Row(id=71, name='Rye'),
 Row(id=882, name='Raw milk of cattle'),
 Row(id=1145, name='Meat of rabbits and hares, fresh or chilled (biological)'),
 Row(id=187, name='Peas, dry')]

In [None]:
# it is more convinient to have a number corresponding to the month apart from its name
months_dict = {
    "January": 1, "February": 2, "March": 3, "April": 4,
    "May": 5, "June": 6, "July": 7, "August": 8,
    "September": 9, "October": 10, "November": 11, "December": 12
}

df_months = spark.createDataFrame(months_dict.items(), ["month_name", "month_num"])

dim_date = df.select("Year", "Months").distinct() \
    .join(df_months, df.Months == df_months.month_name, "left") \
    .withColumnRenamed("Year", "year") \
    .withColumnRenamed("month_num", "month_num") \
    .drop("Months") # avoid redundation 

dim_date.head(5)

[Row(year='2011', month_name='August', month_num=8),
 Row(year='2014', month_name='December', month_num=12),
 Row(year='2023', month_name='September', month_num=9),
 Row(year='2011', month_name='February', month_num=2),
 Row(year='2010', month_name='May', month_num=5)]

Let's also add an information about a corresponding quarter to each date.

In [None]:
from pyspark.sql import functions as F

dim_date = dim_date.withColumn(
    "quarter",
    F.when(F.col("month_num").between(1, 3), "Q1")
     .when(F.col("month_num").between(4, 6), "Q2")
     .when(F.col("month_num").between(7, 9), "Q3")
     .otherwise("Q4")
)

dim_date.head(5)

[Row(year='2011', month_name='August', month_num=8, quarter='Q3', UUID='1d7e66f0-cab1-47fc-944a-b368069c435c', id=0),
 Row(year='2014', month_name='December', month_num=12, quarter='Q4', UUID='79983a0f-d1a3-4f8c-b51d-aedf3ddae799', id=1),
 Row(year='2023', month_name='September', month_num=9, quarter='Q3', UUID='bbc0753c-4fd2-41fb-bd4d-99e5b96e78c0', id=2),
 Row(year='2011', month_name='February', month_num=2, quarter='Q1', UUID='aeed3d10-2aaf-408c-9317-e7d83567de6d', id=3),
 Row(year='2010', month_name='May', month_num=5, quarter='Q2', UUID='0d89eed5-d11a-4880-a49f-767d14957d34', id=4)]

Every table, for the sake of speed, should have an id that is unique and consist of only one column. 
In the item dimension it is already satisfied since every food item has its unique id.
For the date dimension table we will create one artificial.

In [None]:
from pyspark.sql.functions import monotonically_increasing_id

dim_date = dim_date.withColumn("id", monotonically_increasing_id())
dim_date.head(5)

[Row(year='2011', month_name='August', month_num=8, quarter='Q3', id=0),
 Row(year='2014', month_name='December', month_num=12, quarter='Q4', id=1),
 Row(year='2023', month_name='September', month_num=9, quarter='Q3', id=2),
 Row(year='2011', month_name='February', month_num=2, quarter='Q1', id=3),
 Row(year='2010', month_name='May', month_num=5, quarter='Q2', id=4)]

Now we can create our fact table.
It is crucial to connect it to our dimension tables based on the ids.

In [None]:
f = df.select(
    col("Item Code").alias("item_id"),
    col("year"),
    col("Months"),
    col("Price").alias("price_pln")
).alias("f")

d = dim_date.alias("d")
i = dim_item.alias("i")

fact_prices = f.join(
        d,
        (col("f.year") == d.year) &
        (col("f.Months") == d.month_name),
        "left"
    ).join(
        i,
        (col("f.item_id") == i.id)
    ).select(
        col("f.item_id"), 
        col("d.id").alias("date_id"),
        col("price_pln")
    )

+-------+-------+---------+
|item_id|date_id|price_pln|
+-------+-------+---------+
|     44|     98|    756.0|
|     44|    140|    603.0|
|   1121|     98|   6843.0|
|   1121|    140|   7510.0|
|     56|     98|    975.0|
|     56|    140|    510.0|
|    945|     98|   5875.0|
|    945|    140|   5750.0|
|   1095|     98|   3859.0|
|   1095|    140|   3460.0|
|   1056|     98|   4758.0|
|   1056|    140|   4160.0|
|   1145|     98|   5784.0|
|   1145|    140|   8110.0|
|   1013|     98|   7164.0|
|   1013|    140|   8800.0|
|     75|     98|    603.0|
|     75|    140|    454.0|
|    116|     98|    393.0|
|    116|    140|    357.0|
+-------+-------+---------+
only showing top 20 rows


Now we can look at out DW tables:

In [5]:
print("FACT TABLE: food prices")
fact_prices.show()

print("DIMENSION TABLE: food items")
dim_item.show()

print("DIMENSION TABLE: dates")
dim_date.show()

FACT TABLE: food prices


NameError: name 'fact_prices' is not defined

For this DW we can start writing some analytical queries to get insights on the food market.

Let's start from extracting the most expensive product in the last year (on the average).

In [4]:
# Extract ids of the dates with 2024 year
dates2024 = dim_date.select('id').filter(col('year') == 2024)
print(dates2024)
type(dates2024)

NameError: name 'dim_date' is not defined