# Exercise
Read the file: `product_prices.csv` (separator ';', decimal separator '.') and do the following exercises:

1. Using the `columns` method modify the names of columns to be: `'province', 'product_types', 'currency', 'group_id', 'product_line', 'value', 'date'`<br>
1. Determine the following positional statistics: mean, standard deviation, percentiles: 0, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.<br>Do the exercise **in two versions**: using dedicated functions and `describe`.
1. Do you agree with this way of generating these values? Why?


In [3]:
import pandas as pd

df_raw = pd.read_csv(
    "../01_Data/product_prices.csv",  # path to the file with data (if we want to enter the name: filepath_or_buffer)
    sep=";",  # column separator
    decimal=",",  # sign separating the whole and fractional parts of a number
)

df_raw.head()  # display the first few rows and check if the data actually got loaded (the function itself will be discussed later)

Unnamed: 0,Name,Goods types,Measurement unit,Group ID,Product types,Value,Date
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013-3
1,ŁÓDŹ,,PLN,4,bread - per 1kg,,2018-2
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019-12
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019-2
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002-3


In [17]:
df_raw["Value"].unique()  # check unique values in the "Goods types" column

array(['21.37', nan, '3.55', ..., '29.24', '31.39', '30.47'],
      shape=(3752,), dtype=object)

## Comment regarding the correctness of the exercise
Summarizing a dataset in this way is not entirely correct because all groups of products (preserves, meat, oat and other products) were treated as comparable. Moreover, because the data is formatted as a time series, all dates were treated the same.

This is commonly known as comparing oranges to apples. These statistics should at least be performed separately for each of the groups, and ideally sub-divided by quarter of the year.

In [38]:
# change types of the df_raw columns
df_typed = df_raw.astype(
    {
        "Name": "string",
        "Goods types": "string",
        "Measurement unit": "string",
        "Group ID": "Int64",
        "Product types": "string",
        "Value": "float",
    }
)

# replace missing values in Value column with 0
df_typed["Value"] = df_typed["Value"].fillna(0)

# jednoduché rozdělení (pokud všechny hodnoty mají formát "YYYY-M" nebo "YYYY-MM")
df_typed[["Year", "Month"]] = (
    df_typed["Date"]
    .astype(str)
    .str.split("-", expand=True)
    .iloc[:, :2]
    .apply(pd.to_numeric, errors="coerce")
    .astype("Int64")  # nullable integer dtype
)

# Delete column Date
df_typed = df_typed.drop(columns=["Date"])

# Create new column with quarter information
df_typed["Quarter"] = ((df_typed["Month"] - 1) // 3 + 1).astype("Int64")

# df_typed.head()  # display the first few rows and check the changes

In [37]:
df_typed

Unnamed: 0,Name,Goods types,Measurement unit,Group ID,Product types,Value,Year,Month,Quarter
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013,3,1
1,ŁÓDŹ,,PLN,4,bread - per 1kg,0.00,2018,2,1
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019,12,4
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019,2,1
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002,3,1
...,...,...,...,...,...,...,...,...,...
149935,KUYAVIA-POMERANIA,,PLN,2,pork meat (raw bacon) - per 1kg,12.15,2016,11,4
149936,ŁÓDŹ,"beet sugar white, bagged - per 1kg",PLN,3,,0.00,2012,5,2
149937,LESSER POLAND,,PLN,4,plain mixed bread (wheat-rye) - per 1kg,3.05,2008,6,2
149938,WARMIA-MASURIA,,PLN,2,boneless beef (sirloin) - per 1kg,11.87,2000,11,4
