## Selection of data for analysis

Answer the following questions:

1. What was the average price of a carrot and pea mix per 1 kg in Poland in 2015?
2. What was the average price of apple juice in 2016-2018 in the Masovia province?
3. What was the average price of tomato paste in Lower Silesia Province in 2003-2015? Compared with other products does it seem reasonable to you?<br>

Suggest what can be done with values equal to 0. How does this affect the results of point 3?

> All the product groups needed for this task are in the **product_types** column.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('../../01_Data/product_prices_renamed.csv', sep=';', decimal='.')

In [3]:
data

Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date
0,SUBCARPATHIA,,PLN,2,pork ham cooked - per 1kg,21.37,2013-3
1,ŁÓDŹ,,PLN,4,bread - per 1kg,,2018-2
2,KUYAVIA-POMERANIA,,PLN,2,barley groats sausage - per 1kg,3.55,2019-12
3,LOWER SILESIA,,PLN,2,dressed chickens - per 1kg,6.14,2019-2
4,WARMIA-MASURIA,,PLN,2,Italian head cheese - per 1kg,5.63,2002-3
...,...,...,...,...,...,...,...
149935,KUYAVIA-POMERANIA,,PLN,2,pork meat (raw bacon) - per 1kg,12.15,2016-11
149936,ŁÓDŹ,"beet sugar white, bagged - per 1kg",PLN,3,,0.00,2012-5
149937,LESSER POLAND,,PLN,4,plain mixed bread (wheat-rye) - per 1kg,3.05,2008-6
149938,WARMIA-MASURIA,,PLN,2,boneless beef (sirloin) - per 1kg,11.87,2000-11


In [4]:
# Deleting wrong dates
data = data[~data['date'].str.contains('2099-13')]
data = data[~data['date'].str.contains('1888-0')]
data['date'] = pd.to_datetime(data['date'])

In [5]:
# 1) What was the average price of a carrot and pea mix per 1 kg in Poland in 2015?

In [6]:
data['product_types'].unique()

array([nan, 'whole pickled cucumbers 0.9l - per 1pc.',
       'fresh chichen egges - per 666pcs.',
       '30% tomato concentrate - per 1kg',
       'frozen carrot and pea mix - per 1kg',
       'beet sugar white, bagged - per 1kg',
       'apple juice, boxed - per 1l', 'white table salt bagged - per 1kg',
       'natural chocolate plain - per 1kg'], dtype=object)

In [7]:
data['province'].unique()

array(['SUBCARPATHIA', 'ŁÓDŹ', 'KUYAVIA-POMERANIA', 'LOWER SILESIA',
       'WARMIA-MASURIA', 'HOLY CROSS', 'WEST POMERANIA', 'POLAND',
       'PODLASKIE', 'GREATER POLAND', 'POMERANIA', 'LESSER POLAND',
       'SILESIA', 'MASOVIA', 'LUBLIN', 'LUBUSZ', 'OPOLE'], dtype=object)

In [8]:
data[(data['product_types'] == 'frozen carrot and pea mix - per 1kg')
    & (data['date'].dt.year == 2015)
    & (data['province'] == 'POLAND')
]['value'].mean()

0.8291164658634539

In [9]:
# 2) What was the average price of apple juice in 2016-2018 in the Masovia province?

In [10]:
data[(data['product_types'] == 'apple juice, boxed - per 1l')
    & (data['date'].dt.year >= 2016)
    & (data['date'].dt.year <= 2018)
    & (data['province'] == 'MASOVIA')
]['value'].mean()

2.9172222222222226

In [11]:
# 3) What was the average price of tomato paste in Lower Silesia Province in 2003-2015?
# Compared with other products does it seem reasonable to you?
# Suggest what can be done with values equal to 0. How does this affect the results of point 3?

In [12]:
data[(data['product_types'] == '30% tomato concentrate - per 1kg')
    & (data['date'].dt.year >= 2003)
    & (data['date'].dt.year <= 2015)
    & (data['province'] == 'LOWER SILESIA')
]['value'].mean()

19.435192307692304

In [13]:
data[(data['product_types'] == '30% tomato concentrate - per 1kg')
    & (data['date'].dt.year >= 2003)
    & (data['date'].dt.year <= 2015)
    & (data['province'] == 'LOWER SILESIA')
    & (data['value']) != 0
]['value'].mean()

252.65750000000003

In [14]:
df = pd.read_csv(
    '../../01_Data/product_prices_renamed.csv',
  sep=';',
  encoding='UTF-8',
  decimal='.'
)

In [15]:
print(df.info())
#searching columns
print(f"Product_types:\n{df['product_types'].unique()}\n")
print(f"Provinces:\n{df['province'].unique()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149940 entries, 0 to 149939
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   province          149940 non-null  object 
 1   product_types     34272 non-null   object 
 2   currency          149940 non-null  object 
 3   product_group_id  149940 non-null  int64  
 4   product_line      115668 non-null  object 
 5   value             137088 non-null  float64
 6   date              149940 non-null  object 
dtypes: float64(1), int64(1), object(5)
memory usage: 8.0+ MB
None
Product_types:
[nan 'whole pickled cucumbers 0.9l - per 1pc.'
 'fresh chichen egges - per 666pcs.' '30% tomato concentrate - per 1kg'
 'frozen carrot and pea mix - per 1kg'
 'beet sugar white, bagged - per 1kg' 'apple juice, boxed - per 1l'
 'white table salt bagged - per 1kg' 'natural chocolate plain - per 1kg']

Provinces:
['SUBCARPATHIA' 'ŁÓDŹ' 'KUYAVIA-POMERANIA' 'LOWER SILESIA'
 'WAR

In [16]:
# What was the average price of a carrot and pea mix per 1 kg in Poland in 2015?
carrot_pea_mix = df.query("product_types == 'frozen carrot and pea mix - per 1kg' and date >= '2015-1' and date <= '2015-12'")
print(f"number o rows: {carrot_pea_mix.shape[0]}")
print(f"average value: {carrot_pea_mix['value'].mean()}")
carrot_pea_mix

number o rows: 68
average value: 0.45967399007795884


Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date
2790,WARMIA-MASURIA,frozen carrot and pea mix - per 1kg,EUR,1,,0.607229,2015-12
6670,LOWER SILESIA,frozen carrot and pea mix - per 1kg,EUR,1,,0.585542,2015-12
7073,MASOVIA,frozen carrot and pea mix - per 1kg,EUR,1,,0.937349,2015-12
8542,POLAND,frozen carrot and pea mix - per 1kg,EUR,1,,0.824096,2015-1
10810,GREATER POLAND,frozen carrot and pea mix - per 1kg,EUR,1,,0.787952,2015-1
...,...,...,...,...,...,...,...
141507,LUBUSZ,frozen carrot and pea mix - per 1kg,EUR,1,,0.000000,2015-12
142019,SILESIA,frozen carrot and pea mix - per 1kg,EUR,1,,0.580723,2015-11
142222,LOWER SILESIA,frozen carrot and pea mix - per 1kg,EUR,1,,0.000000,2015-1
147028,LESSER POLAND,frozen carrot and pea mix - per 1kg,EUR,1,,0.845783,2015-1


In [17]:
# What was the average price of apple juice in 2016-2018 in the Masovia province?
df_apple_juice = df.loc[
    (df['province'] == 'MASOVIA') &
    (df['product_types'] == 'apple juice, boxed - per 1l') &
    (df['date'] >= '2016-1') &
    (df['date'] <= '2018-12')
]
print(f"number o rows: {df_apple_juice.shape[0]}")
print(f"average value: {df_apple_juice['value'].mean()}")
df_apple_juice.head()

number o rows: 28
average value: 3.02


Unnamed: 0,province,product_types,currency,product_group_id,product_line,value,date
8147,MASOVIA,"apple juice, boxed - per 1l",PLN,1,,3.06,2018-10
9935,MASOVIA,"apple juice, boxed - per 1l",PLN,1,,3.29,2017-3
11100,MASOVIA,"apple juice, boxed - per 1l",PLN,1,,2.94,2016-5
18729,MASOVIA,"apple juice, boxed - per 1l",PLN,1,,3.26,2017-10
24353,MASOVIA,"apple juice, boxed - per 1l",PLN,1,,2.92,2016-10


In [18]:
# What was the average price of tomato paste in Lower Silesia Province in 2003-2015?
# Compared with other products does it seem reasonable to you?

# with null value
df_tomato_paste = df.loc[
    (df['province'] == 'SILESIA') &
    (df['product_types'] == '30% tomato concentrate - per 1kg') &
    (df['date'] >= '2003-1') &
    (df['date'] <= '2015-12')
]
print(f"number o rows: {df_tomato_paste.shape[0]}")
print(f"average value: {df_tomato_paste['value'].mean()}")

number o rows: 148
average value: 25.028581081081086


In [19]:
# chechking null value
df_tomato_paste['value'].unique()

array([4.96e+00, 4.76e+00, 6.07e+00, 5.84e+00, 5.47e+00, 4.49e+00,
       0.00e+00, 1.59e+00, 6.26e+00, 6.02e+00, 5.42e+00, 6.17e+00,
       6.12e+00, 5.61e+00, 4.33e+00, 6.21e+00, 4.93e+00, 3.05e+00,
       5.62e+00, 4.99e+00, 6.14e+00, 5.98e+00, 6.30e+00, 6.13e+00,
       5.80e+00, 5.02e+00, 4.88e+00, 5.29e+00, 4.26e+00, 5.03e+00,
       5.36e+00, 1.93e+00, 5.22e+00, 2.66e+00, 6.29e+00, 4.98e+00,
       6.04e+00, 6.16e+00, 6.24e+00, 5.96e+00, 6.20e+00, 4.17e+00,
       5.65e+00, 2.47e+00, 5.69e+00, 5.93e+00, 5.78e+00, 5.73e+00,
       5.72e+00, 6.53e+00, 5.66e+00, 4.79e+00, 5.21e+00, 6.03e+00,
       5.89e+00, 4.84e+00, 6.57e+00, 3.35e+00, 5.82e+00, 5.92e+00,
       6.43e+00, 6.18e+00, 5.01e+00, 5.83e+00, 5.59e+00, 5.94e+00,
       3.52e+00, 5.35e+00, 5.95e+00, 6.75e+00, 5.00e+00, 6.11e+00,
       3.24e+00, 5.85e+00, 6.15e+00, 5.74e+00, 1.60e+00, 6.01e+00,
       5.75e+00, 5.86e+00, 5.53e+00, 5.58e+00, 5.57e+00, 5.52e+00,
       5.16e+00, 4.62e+00, 5.60e+00, 6.23e+00, 5.28e+00, 5.15e

In [20]:
# number of null value
len(df_tomato_paste.query("value == 0"))

13

In [21]:
# without null value
df_tomato_paste = df.loc[
    (df['province'] == 'SILESIA') &
    (df['product_types'] == '30% tomato concentrate - per 1kg') &
    (df['date'] >= '2003-1') &
    (df['date'] <= '2015-12') &
    (df['value'] != 0)
]
print(f"number o rows: {df_tomato_paste.shape[0]}")
print(f"average value: {df_tomato_paste['value'].mean()}")

number o rows: 135
average value: 27.438740740740744
