# Introduction

This goal of this project is to analyse choclate sales data from a dataset uploaded to Kaggle (https://www.kaggle.com/datasets/atharvasoundankar/chocolate-sales/data). I love chocolate and after Breakaway being discontinued and Yorkie biscuit having the same fate, I want to see what the next victim could be (I am not talking about myself). 

<h2> Collection Methodology</h2>
Data was aggregated from chocolate retailers and online marketplaces.
Only confirmed transactions were included to ensure accuracy.
Revenue values reflect final prices after applying discounts, if any.


This project will scope, analyse, prepare, plot data, and seek to explain the findings from the analysis.

Here are a few questions that this project has sought to answer:

- What purchasing patterns are evident across different customer segments?
- Do certain types of chocolate categories perform best in different locations?
- What potential marketing strategies can be employed based on data insights, how valuale is chocolate?

**Data sources:**
 `ChocolateSales.csv` 


## Scoping

It's beneficial to create a project scope. Four sections were created below to help guide the project's process and progress. The first section is the project goals, this section will define the high-level objectives and set the intentions for this project. The next section is the data, it needs to be checked if project goals can be met with the available data. Thirdly, the analysis will have to be thought through, which includes the methods and questions that are aligned with the project goals. Lastly, evaluation will help us build conclusions and findings from our analysis.

### Project Goals

In this project the perspective will be through a chocloate lover who wants to gain an insight into chocolate sales across the globe, trends in types of chocolates being purchased and how valuable the sweet treat is. 

- What purchasing patterns are evident across different customer segments?
- Do certain types of chocolate categories perform best in different locations?
- What potential marketing strategies can be employed based on data insights, how valuale is chocolate?

### Data

This project has one data set as a `csv` file. This dataset contains detailed records of chocolate sales, including product details, sales quantities, revenue, and customer segments. 

### Analysis

In this section, descriptive statistics and data visualisation techniques will be employed to understand the data better. Statistical inference will also be used to test if the observed values are statistically significant. Some of the key metrics that will be computed include: 

1. Distributions
1. Counts
1. Relationship between location and sales
1. Customer behaviour analysis
1. Seasonality trends 

### Evaluation

Lastly, it's a good idea to revisit the goals and check if the output of the analysis corresponds to the questions first set to be answered (in the goals section). This section will also reflect on what has been learned through the process, and if any of the questions were unable to be answered. This will also include limitations or if any of the analysis could have been done using different methodologies.

## Import Python Modules

In [231]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

## Loading in the data

In [232]:
salesData = pd.read_csv("/Users/jai/Documents/GitHub/EDARepo/ChocolateEDA/ChocolateSales.csv")
salesData.head()

Unnamed: 0,Sales Person,Country,Product,Date,Amount,Boxes Shipped
0,Jehu Rudeforth,UK,Mint Chip Choco,04-Jan-22,"$5,320",180
1,Van Tuxwell,India,85% Dark Bars,01-Aug-22,"$7,896",94
2,Gigi Bohling,India,Peanut Butter Cubes,07-Jul-22,"$4,501",91
3,Jan Morforth,Australia,Peanut Butter Cubes,27-Apr-22,"$12,726",342
4,Jehu Rudeforth,UK,Peanut Butter Cubes,24-Feb-22,"$13,685",184


The <code>ChocolateSales.csv</code> information on chocolate sales across the globe. The columns in the dataset are:
<ul>
<li>Sales Person - the sales person who sold the product</li> 
<li>Country - Sales region or store location where the transaction took place</li>
<li>Product - Name of the chocolate product sold</li>
<li>Date - The transaction date of the chocolate sale</li>
<li>Amount - Total revenue generated from the sale</li>
<li>Boxes Shipped - Number of chocolate units sold in the transaction</li>
</ul>

<h2>Data characteristics</h2>

In [233]:
salesData.shape

(1094, 6)

In [234]:
salesData.dtypes

Sales Person     object
Country          object
Product          object
Date             object
Amount           object
Boxes Shipped     int64
dtype: object

In [235]:
salesData.isna().sum()

Sales Person     0
Country          0
Product          0
Date             0
Amount           0
Boxes Shipped    0
dtype: int64

We have no missing data. Amount needs to be converted to a flaot first and foremost to enable descriptive analysis on the column later down the line.

In [243]:
salesData["Amount"] = salesData["Amount"].astype(str)
salesData["Amount"] = salesData["Amount"].str.replace("," , "")
salesData["Amount"] = salesData["Amount"].str.replace(r"\$", "", regex=True).astype(float)

In [237]:
salesData.dtypes

Sales Person      object
Country           object
Product           object
Date              object
Amount           float64
Boxes Shipped      int64
dtype: object

<h2>Explore the data</h2>

Find the number of distinct enterties for each column in the data. The dataset includes data for 385 boxes of chocolates sold across 6 countries. 

In [238]:
print(f"Number of distinct enteries in each columns:{salesData.nunique()}")

Number of distinct enteries in each columns:Sales Person      25
Country            6
Product           22
Date             168
Amount           827
Boxes Shipped    385
dtype: int64


Taking a look into the types of <code>Product</code> sold. The most popular products sold are the 50% Dark Bites and the Eclairs. The least popular product sold is the Choco Coated Almonds (perfect as I am allergic anyway). 

In [244]:
salesData.groupby("Product").size().sort_values()

Product
Choco Coated Almonds    39
Baker's Choco Chips     41
70% Dark Bites          42
Caramel Stuffed Bars    43
Mint Chip Choco         45
Manuka Honey Choco      45
Orange Choco            47
Almond Choco            48
Raspberry Choco         48
Milk Bars               49
Peanut Butter Cubes     49
99% Dark & Pure         49
Fruit & Nut Bars        50
After Nines             50
85% Dark Bars           50
Organic Choco Syrup     52
Spicy Special Slims     54
Drinking Coco           56
White Choc              58
Smooth Sliky Salty      59
50% Dark Bites          60
Eclairs                 60
dtype: int64

Next I want to see what country made the most purchases. This has 6 categories and the countries included in the dataset are:
<ul>
<li>Australia</li>
<li>Canada</li>
<li>India</li>
<li>New Zealand</li>
<li>UK</li>
<li>USA</li>
</ul>

Australia (205) placed the most orders for chocolate and New Zealand the least (173).

In [240]:
salesData.groupby("Country").size().sort_values()

Country
New Zealand    173
Canada         175
UK             178
USA            179
India          184
Australia      205
dtype: int64

Taking a look into the <code>Amount</code> of each order we can see the mean value of an order is $5652.31 with a large standard deviation of $4102.44 suggesting there is much variation in the order amount. This is likely because the Mean is skewed by the maximum value of $22050 and a mininimum order value of just $7.00. The median value is $4868.50 and should be taken as a more accurate central measure due to the presence of potential skewness.

In [241]:
salesData["Amount"].describe()

count     1094.000000
mean      5652.308044
std       4102.442014
min          7.000000
25%       2390.500000
50%       4868.500000
75%       8027.250000
max      22050.000000
Name: Amount, dtype: float64