First exploratory steps in chocolate prices

We have established the tables in a SQL database using DBeaver. The main questions we would like to explore are the following:

Is there a cost pressure on cocoa market? We will explore the prices table
What is the market size per capita? Identifying the top country consuming chocolate in absolute and proportional terms
Is the consumption seasonality-driven? We wanted to explore two possible variables with two sets of countries:
    - United Kingdom and New Zealand will be analysed from the perspective of a relatively small territory and the weather as impact.
    - USA, Canada, India, Australia: bigger countries that are to be analysed based on the national holidays, a wide enough set of days of free time for a sizeable portion of the inhabitants and that comprises the whole population

    (Further questions: are UK and NZ also affected by holidays? Testing differences between both sets of countries)


Weather data:

New Zealand data has been taken doing a daily average of five stations during the period, chosen for being geographical and climatic zone representative.


| Station                   | Station ID | Location / Region         | Climate Representation |
| ------------------------- | ---------- | ------------------------- | ---------------------- |
| Tauranga Airport          | 1615       | Auckland / North Island   | Subtropical            |
| Wellington, Greta Point   | 41212      | Wellington / North Island | Windy-Maritime         |
| Westport                  | 41382      | West Coast / South Island | Very Wet               |
| Christchurch, Kyle Street | 24120      | East Coast / South Island | Dry                    |
| Queenstown Aiport         | 5451       | South Island Interior     | Continental            |






In [40]:
# Imports

import pandas as pd 
import numpy as np


After importing the necessary libraries, we proceed to clean and give structure to the data, starting by the NZ temperature and rainfall. The data has been obtained in 10 files from NIWA and needs to be condensed in one table with the relevant information.

In [41]:
# Clean data for NZ weather from files


code_stations = ['1615','41212','41382','24120','5451']
extracted_rows_r = []
extracted_rows_t =[]

for code in code_stations:
      filename = './datasets/nz/' + code + '__Rain__daily.csv'
      table_raw = pd.read_csv(filename)
      table_raw = table_raw.rename(columns={"Observation time UTC": "Date","Rainfall [mm]": "Rainfall"})
      table_raw['Station ID'] = code
      extracted_rows_r.append(table_raw[['Date','Station ID','Rainfall']])

for code in code_stations:
      filename = './datasets/nz/' + code + '__Temperature__daily.csv'
      table_raw = pd.read_csv(filename)
      table_raw = table_raw.rename(columns={"Observation time UTC": "Date","Mean Temperature [Deg C]": "Temperature"})
      table_raw['Station ID'] = code
      extracted_rows_t.append(table_raw[['Date','Station ID','Temperature']])

rain_df = pd.concat(extracted_rows_r, ignore_index=True)
temp_df = pd.concat(extracted_rows_t, ignore_index=True)

# Format dates
rain_df["Date"] = pd.to_datetime(rain_df["Date"]).dt.date
temp_df["Date"] = pd.to_datetime(temp_df["Date"]).dt.date

#print(rain_df['Station ID'].agg(['nunique']))
#print(temp_df['Station ID'].agg(['nunique']))

In [42]:
# Filter data range
start_date = pd.to_datetime("2022-01-01").date()
end_date = pd.to_datetime("2025-12-31").date()

filter_rain = rain_df[(rain_df["Date"] >= start_date) & (rain_df["Date"] <= end_date)]
filter_temp = temp_df[(temp_df["Date"] >= start_date) & (temp_df["Date"] <= end_date)]

#filter_rain['Station ID'].agg(['nunique'])
#filter_rain['Station ID'].unique().tolist()


In [43]:
filter_rain


Unnamed: 0,Date,Station ID,Rainfall
11472,2022-01-01,1615,0.0
11473,2022-01-02,1615,0.0
11474,2022-01-03,1615,0.0
11475,2022-01-04,1615,0.0
11476,2022-01-05,1615,0.0
...,...,...,...
44261,2025-12-27,5451,2.8
44262,2025-12-28,5451,0.0
44263,2025-12-29,5451,0.0
44264,2025-12-30,5451,0.0


In [44]:
filter_temp

Unnamed: 0,Date,Station ID,Temperature
11499,2022-01-01,1615,22.7
11500,2022-01-02,1615,19.9
11501,2022-01-03,1615,20.9
11502,2022-01-04,1615,22.3
11503,2022-01-05,1615,22.9
...,...,...,...
41312,2025-12-27,5451,10.4
41313,2025-12-28,5451,12.8
41314,2025-12-29,5451,14.9
41315,2025-12-30,5451,14.0


In [45]:
filter_rain.to_csv('rainfall.csv')
filter_temp.to_csv('temperature.csv')

In [46]:
#Pivot the tables

rain_wide = filter_rain.pivot_table(
    index="Date",
    columns="Station ID",
    values="Rainfall",
    fill_value=np.nan
)

temp_wide = filter_temp.pivot_table(
    index="Date",
    columns="Station ID",
    values="Temperature",
    fill_value=np.nan
)


In [47]:
rain_wide

Station ID,1615,24120,41212,41382,5451
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-01-01,0.0,0.0,0.0,0.0,0.0
2022-01-02,0.0,0.0,0.0,0.0,0.0
2022-01-03,0.0,0.0,0.0,0.0,0.0
2022-01-04,0.0,0.0,0.0,0.0,0.0
2022-01-05,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
2025-12-27,0.2,3.8,0.0,0.0,2.8
2025-12-28,0.8,0.0,0.0,0.0,0.0
2025-12-29,26.6,0.0,0.6,0.0,0.0
2025-12-30,0.1,3.0,18.0,8.6,0.0


In [48]:
temp_wide


Station ID,1615,24120,41212,41382,5451
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-01-01,22.7,21.6,20.4,17.8,17.9
2022-01-02,19.9,19.5,24.2,19.1,21.3
2022-01-03,20.9,20.6,21.5,19.3,20.0
2022-01-04,22.3,20.9,21.3,18.6,17.8
2022-01-05,22.9,19.5,17.9,19.1,17.6
...,...,...,...,...,...
2025-12-27,16.6,15.4,19.3,13.7,10.4
2025-12-28,19.2,15.2,19.1,16.4,12.8
2025-12-29,18.1,17.0,18.2,22.2,14.9
2025-12-30,21.9,13.5,15.7,16.7,14.0


In [49]:
temp_wide['Daily mean C'] = temp_wide.iloc[:, :-1].mean(axis=1).round(2)
rain_wide['Daily mean mm'] = rain_wide.iloc[:, :-1].mean(axis=1).round(2)

print(temp_wide.head())
print(rain_wide.head())

Station ID  1615  24120  41212  41382  5451  Daily mean C
Date                                                     
2022-01-01  22.7   21.6   20.4   17.8  17.9         20.62
2022-01-02  19.9   19.5   24.2   19.1  21.3         20.67
2022-01-03  20.9   20.6   21.5   19.3  20.0         20.58
2022-01-04  22.3   20.9   21.3   18.6  17.8         20.78
2022-01-05  22.9   19.5   17.9   19.1  17.6         19.85
Station ID  1615  24120  41212  41382  5451  Daily mean mm
Date                                                      
2022-01-01   0.0    0.0    0.0    0.0   0.0            0.0
2022-01-02   0.0    0.0    0.0    0.0   0.0            0.0
2022-01-03   0.0    0.0    0.0    0.0   0.0            0.0
2022-01-04   0.0    0.0    0.0    0.0   0.0            0.0
2022-01-05   0.0    0.0    0.0    0.0   0.0            0.0


In [50]:
# Saving final data to files for upload

temp_wide.to_csv('./datasets/jl_nz_temperatures.csv')
rain_wide.to_csv('./datasets/jl_nz_rainfall.csv')