# Distributors Table

| Name | Type | Granularity | Primary Key | Volumetry | Business Definition |
|------|------|-------------|-------------|-----------|-------------------|
| distributors | Dimensions | Individual distributor | distributor_id | 32 rows - 7 columns | List of major events that could have an influence on champagne sales in North America. |

## Imports

In [6]:
# To import data
import pandas as pd 
import numpy as np

# To import path to access packages
import sys
from pathlib import Path

# Find the folder containing 'packages'
root = next(p for p in Path.cwd().resolve().parents if (p / "packages").exists())
sys.path.insert(0, str(root))

# Functions for data preparation
from packages.utils.data_exploration import (
    find_primary_key,find_volumetry, find_number_of_unique_values_per_column, 
    share_of_null, share_of_rows_with_null, share_of_columns_with_null, share_of_null_within_columns
)

In [7]:
# Import the external_events dataset
external_events = pd.read_csv('../../../data/dimensions/external_events.csv')

In [109]:
# Print the first rows
external_events.head()

Unnamed: 0,date,country,region/state,event_name,event_type,impact_score,source
0,12/31/2023,USA,,New Years Eve,holiday,0.95,public_holidays
1,2023-02-14,USA,,Valentine’s Day,holiday,0.55,public_holidays
2,2023-07-01,Canada,ON,,holiday,0.45,public_holidays
3,2023-10-25,USA,NY,NYC Wine & Food Fest,festival,0.3,event_calendar
4,2023-09-05,Canada,ON,TIFF,festival,0.4,event_calendar


## Analyze the table

### Find the primary key

In [110]:
# Utilize the find_primary_key from the data_exploration package to obtain the primary key
find_primary_key(external_events)

[]

The granularity of the table is the individual distributor.

### Volumetry

In [111]:
# Utilize the find_volumetry from the data_exploration package to obtain the number of rows
print(f"There are {find_volumetry(external_events)} rows in the dataset.")

There are 32 rows in the dataset.


### Number of unique values per column

In [112]:
find_number_of_unique_values_per_column(external_events)

{2, 4, 5, 9, 24}

In [113]:
# Count the number of unique values for each column in the dataset 'distributors'. 
# Sort the results to have the columns with the least number of unique values first.
unique_counts = find_number_of_unique_values_per_column(external_events)
print(unique_counts)

# Print results
for i, col in enumerate(external_events.columns):
    if i < len(unique_counts):
        print(f"The number of unique values in the column {col} is {unique_counts[i]}.")
    else:
        print(f"No data for the column {col}")

{2, 4, 5, 9, 24}


TypeError: 'set' object is not subscriptable

In [14]:
print(external_events.isna().sum())

date             5
country          0
region/state    16
event_name       3
event_type       0
impact_score     0
source           0
dtype: int64
