# Candy Store Analysis
This Notebook briefly analyzes a mix of the [The Ultimate Halloween Candy Power Ranking dataset](https://www.kaggle.com/datasets/fivethirtyeight/the-ultimate-halloween-candy-power-ranking) and the [Retail Sales dataset](https://www.kaggle.com/datasets/mohammadtalib786/retail-sales-dataset) from Kaggle.

## Data Gathering
### Data set download using Kaggle Python API
The first part of this notebook downloads the data set using the Kaggle python API.

The files are downloaded only if there are not already in the local folder or if they are not up to date anymore.

In [20]:
import os
import sys
from typing import List

# Handle relative import of modules
src_path = os.path.abspath(os.path.join("../../src"))
if src_path not in sys.path:
    sys.path.append(src_path)

In [21]:
from helpers import kaggle_helper

dataset_folder = "dataset"
kaggle_helper.download_dataset_files(
    dataset_author="mohammadtalib786",
    dataset_name="retail-sales-dataset",
    dataset_folder=dataset_folder,
)
kaggle_helper.download_dataset_files(
    dataset_author="fivethirtyeight",
    dataset_name="the-ultimate-halloween-candy-power-ranking",
    dataset_folder=dataset_folder,
)

Listing local csv files in ./dataset.
File retail_sales_dataset.csv with size 51673 found in ./dataset
File candy-data.csv with size 5205 found in ./dataset
Listing files associated with Kaggle dataset mohammadtalib786/retail-sales-dataset.
File retail_sales_dataset.csv with size 51673 retrieved from Kaggle API.
Listing local csv files in ./dataset.
File retail_sales_dataset.csv with size 51673 found in ./dataset
File candy-data.csv with size 5205 found in ./dataset
Listing files associated with Kaggle dataset fivethirtyeight/the-ultimate-halloween-candy-power-ranking.
File candy-data.csv with size 5205 retrieved from Kaggle API.


### Pandas Data frames creation
Once the csv files associated with the Kaggle data set are downloaded, we can open (read) them inside a Pandas `DataFrame`.

In [22]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

df_candies = pd.read_csv(f"{dataset_folder}/candy-data.csv")
print(
    f"The candies data set has {len(df_candies)} candies with {df_candies.shape[1]} variables."
)
df_candies.head()

The candies data set has 85 candies with 13 variables.


Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


In [23]:
df_retail_sales = pd.read_csv(f"{dataset_folder}/retail_sales_dataset.csv")
print(
    f"The retail sales data set has {len(df_retail_sales)} entries with {df_retail_sales.shape[1]} variables."
)
df_retail_sales.head()

The retail sales data set has 1000 entries with 9 variables.


Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


In [40]:
candy_types = [
    "chocolate",
    "fruity",
    "caramel",
    "peanutyalmondy",
    "nougat",
    "crispedricewafer",
    "hard",
    "bar",
    "pluribus",
]
measures_of_interest = ["pricepercent", "winpercent"]
values = []
for candy_type in candy_types:
    df_temp = df_candies.groupby(candy_type)[measures_of_interest].mean()
    type_price, type_win = df_temp.iloc[1]
    values.append((candy_type.capitalize(), type_price, type_win / 100))
df_categories = pd.DataFrame(
    values, columns=["Category", "avgpricepercent", "avgwinpercet"]
)
df_categories.to_csv("dataset/candy-type-data.csv", index=False)