# Questions and hypothesis

We proceed to analyse here the main three topics in the project. For this, we will formulate three questions:

- What is the market size for each country?
- Does population size affect the market size/consumption and how strong is the correlation?
- Is chocolate consumption affected by bad weather or holiday?

## What is the market size for each country?

To answer this question, we will build graph with data of total consumption per country, represented via both sales and units sold.


In [2]:
# Imports

import pandas as pd 
from dotenv import load_dotenv, dotenv_values
import os

from sqlalchemy import create_engine, types
from sqlalchemy import text # to be able to pass string
from sqlalchemy import Integer, String, Float, DateTime, Date


# Loading values from .env
config = dotenv_values()

# Define variables for the login
load_dotenv()
user = os.getenv("DB_USER")
password = os.getenv("DB_PASSWORD")
host = os.getenv("DB_HOST")
port = os.getenv("DB_PORT")
dbname = os.getenv("DB_NAME")
schema = os.getenv("DB_SCHEMA")

# PostgreSQL URL creation
url = f'postgresql://{user}:{password}@{host}:{port}/{dbname}'

# Create engine
engine = create_engine(url)


In [9]:
# Load table of sales
with engine.connect() as conn:
    conn.execute(text(f"SET search_path TO {schema}"))
    sales = pd.read_sql(
        text("SELECT * FROM jl_sales"),
        conn
    )

In [10]:
print(sales.head())
print('=================================================================================================================================================')
print(sales.describe())

     Sales Person    Country              Product        Date   Amount  \
0  Jehu Rudeforth         UK      Mint Chip Choco  2022-01-04   5320.0   
1     Van Tuxwell      India        85% Dark Bars  2022-08-01   7896.0   
2    Gigi Bohling      India  Peanut Butter Cubes  2022-07-07   4501.0   
3    Jan Morforth  Australia  Peanut Butter Cubes  2022-04-27  12726.0   
4  Jehu Rudeforth         UK  Peanut Butter Cubes  2022-02-24  13685.0   

   Boxes Shipped  Year  Month  
0            180  2022      1  
1             94  2022      8  
2             91  2022      7  
3            342  2022      4  
4            184  2022      2  
             Amount  Boxes Shipped         Year        Month
count   3282.000000    3282.000000  3282.000000  3282.000000
mean    6030.338775     164.666971  2023.000000     4.576782
std     4393.980200     124.024736     0.816621     2.315759
min        7.000000       1.000000  2022.000000     1.000000
25%     2521.495000      71.000000  2022.000000     3.0000

In [16]:
sales.groupby("Country")["Amount"].sum().sort_values()


Country
New Zealand    3043654.04
Canada         3078495.65
USA            3313858.09
India          3343730.83
UK             3365388.90
Australia      3646444.35
Name: Amount, dtype: float64

In [17]:
sales.groupby("Country")["Boxes Shipped"].sum().sort_values()

Country
New Zealand    81350
USA            81820
India          89968
UK             92523
Canada         95158
Australia      99618
Name: Boxes Shipped, dtype: int64

We can clearly see here that Australia is the first country in both sales and consumption of the chocolate products, whereas New Zealand has the last position. The rest of countries have different positions and do not display, at this very first approach, a clear position within the global market.

## Does population size affect the market size/consumption and how strong is the correlation?

We will now add the population size data to the analysis, by loading it from the database

In [18]:
# Load table of sales
with engine.connect() as conn:
    conn.execute(text(f"SET search_path TO {schema}"))
    population = pd.read_sql(
        text("SELECT * FROM jl_population_2022_2025"),
        conn
    )

ProgrammingError: (psycopg2.errors.InsufficientPrivilege) permission denied for table jl_population_2022_2025

[SQL: SELECT * FROM jl_population_2022_2025]
(Background on this error at: https://sqlalche.me/e/20/f405)

population