# Baron Honey Co. Raw Honey Market Pricing Analysis  

## Table of Contents 
1. [Introduction](#Introduction)
2. [Executive Summary](#Executive-Summary)
3. [Import Data](#import-data)
4. [Clean Data](#clean-data)
5. [Filter Data Set](#Filter-Data-Set)
6. [Initial Exploration](#Initial-Exploration)
7. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
8. [Most Bought Products](#most-bought-products) 
9. [Most Reviewed Products](#products-with-highest-reviews)
10. [Natural Language Processing](#natural-language-processing-analysis)
10. [Recommendations](#recommendations)

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Introduction

- **Business Problem**: Baron Honey Co. is releasing a new raw honey product in the United States and want to know how to competitively price their honey in the current market. 
- **Business Solution**: 
    Baron Honey Company is releasing a new raw honey product to market in the United States and in order to set the product's pricing, it must know the pricing of competing honey products. To accomplish this, competing products will be clustered into groups based on how many products they sold last month and a range of descriptive statistics(average price per ounce, average monthly sales, average rating etc.) will be calculated. This process will be repeated for the most reviewed products to find any discernible differences between the two categories. Lastly, Natural Language Processing techniques will be used to find the most used words in the titles and description of the top 10 best selling products. This will allow Baron Honey Co. to write product listings containing popular search words which will increase their page ranking when a customer searches for products. With this information the executives at Baron Honey Co. will be able to competively price their new honey product and write an optimal product listing. 

- **Dataset Overview**: There will be only one dataset used which was acquired by web scrapping E-commerce websites. This provides the most up-to-date information on the honey market for accurate analysis. 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Import data 

In [2]:
import re
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from wordcloud import WordCloud
import sys 

sys.path.append('../')

from data_processing.stats import *
from data_processing.database_operations import *
from data_processing.cleaning_pipeline import *

%load_ext autoreload
%autoreload 2

In [3]:
df = extract_scraped_data("clean_data")
df.head()

Unnamed: 0,index,title,brand,weight,price,price_per_ounce,product_rating,bought_last_month,num_reviews,product_description,product_upc,date_acquired
0,1,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,16.0,7.97,0.5,4.7,10000.0,66010.0,"Nate's 100% Pure, Raw & Unfiltered Honey is ...",38778830161,2025-05-15
1,2,"Amazon Grocery, Raw Wildflower Pure Ho...",Amazon Grocery,32.0,11.41,0.36,4.7,10000.0,17839.0,One 2 pound bottle of Raw Wildflower Pure Ho...,842379155444,2025-05-15
2,3,HONEY FEAST Wildflower Honey - 6 Pound...,Honey Feast,96.0,34.89,0.36,4.5,800.0,555.0,Generous Bulk Offering: Step into the world ...,857598008617,2025-05-15
3,4,"Raw, Unfiltered, Unpasteurized Texas H...",Desert Creek,960.0,254.0,0.26,4.6,100.0,440.0,The perfect bulk size! Perfectly deliciou...,853550002266,2025-05-15
4,5,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,32.0,13.69,0.43,4.7,30000.0,66010.0,"Nate's 100% Pure, Raw & Unfiltered Honey is ...",38778830321,2025-05-15


## Clean data 

In [279]:
# df = cleaning_pipeline(df)
# df.head()

In [280]:
# df.to_csv("clean_honey_data.csv")

In [281]:
# insert_clean_data(df, 'clean_data')

## Filter Data Set

Because Baron Honey Co.'s product is raw honey our analysis will only look at other products that are also listed as "raw" by the sellers. 

In [4]:
# Only keep rows that contain mention of Raw Honey in title or description 
filter_title = (
    df['title'].str.contains(r'\bRaw\b', case=False, regex=True, na=False) & 
    df['title'].str.contains(r'\bHoney\b', case=False, regex=True, na=False)
)

filter_desc = (
    df['product_description'].str.contains(r'\bRaw\b', case=False, regex=True, na=False) & 
    df['product_description'].str.contains(r'\bHoney\b', case=False, regex=True, na=False)
)

df = df[filter_title | filter_desc]

## Initial Exploration

This section's goal is to understand the characteristics such as data types and distribution of the honey dataset

In [283]:
df.dtypes

index                    int64
title                   object
brand                   object
weight                 float64
price                  float64
price_per_ounce        float64
product_rating         float64
bought_last_month      float64
num_reviews            float64
product_description     object
product_upc             object
date_acquired           object
dtype: object

In [284]:
df.describe()

Unnamed: 0,index,weight,price,price_per_ounce,product_rating,bought_last_month,num_reviews
count,9791.0,9092.0,9565.0,8958.0,9431.0,5660.0,9431.0
mean,6632.898682,29.688677,36.754641,3.689223,4.474298,1325.556537,2642.514368
std,3744.182708,74.846798,35.668462,9.721017,0.307883,3532.459504,8927.355272
min,1.0,1.0,4.0,0.18,2.4,50.0,1.0
25%,3382.5,8.82,15.43,0.78,4.4,100.0,72.0
50%,6649.0,16.0,25.0,1.45,4.6,200.0,332.0
75%,9755.5,24.5,41.0,3.06,4.7,700.0,1060.0
max,13214.0,960.0,298.88,113.95,5.0,30000.0,66202.0


In [285]:
len(df)

9791

In [286]:
df.isna().sum()

index                     0
title                     0
brand                   161
weight                  699
price                   226
price_per_ounce         833
product_rating          360
bought_last_month      4131
num_reviews             360
product_description    2248
product_upc            2916
date_acquired             0
dtype: int64

In [287]:
# How many unique products are in the data 
len(df['product_upc'].unique())

378

## Exploratory Data Analysis 

The data will be filtered, analyzed and visualized to answer the question in the business problem statement. The data will first be filtered to exclude Manuka honey products, which originate in New Zealand and are uniquely known for their medicinal healing benefits. Their inclusion in the analysis would skew our data. Because Honey Baron Co.'s new product is a fluid jar of honey, all single serving honey products will also be excluded to ensure the integrity of the data. 

Afterwards, descriptive statistics such as the average price per ounce(APPO), average sales last month, and average product rating will be computed for each of the-top 5, top 10 and so on-category of products. 

In [288]:
# Should we eliminate these outliers?  
df = df[df['price_per_ounce'] < 5]

### Most Bought Products

In [289]:
# Exclude Manuka honey 
df = df[~df['title'].str.contains("[Mm]anuka")]  
df = df[~df['title'].str.contains("[Gg]inger [Hh]oney")]  

# Exclude all nan values from the data 
most_bought = df[df['bought_last_month'].isna() == False] 

# Exclude all single serving products  
exclude_pattern = r'(?i)\b(?:Straw(?:s)?|Stick(?:s)?|Packet(?:s)?|Single-Serve)\b'

most_bought = most_bought[
  ~(most_bought['title'].str.contains(exclude_pattern, na=False) | 
  most_bought['product_description'].str.contains(exclude_pattern, na=False))
]

# sort the df by the most sold products 
most_bought = most_bought.sort_values(by=['bought_last_month'], ascending=False)

# Only want to see unique products  
most_bought = most_bought.drop_duplicates(subset='product_upc')

most_bought[:10][['title', 'brand', 'price_per_ounce', 'product_rating', 'bought_last_month', 'num_reviews', 'product_upc', 'date_acquired']]

Unnamed: 0,title,brand,price_per_ounce,product_rating,bought_last_month,num_reviews,product_upc,date_acquired
5476,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.45,4.7,30000.0,66105.0,38778830321,2025-05-20
4528,"Nate's Organic 100% Pure, Raw & Unfilt...",Nature Nate's,0.53,4.6,10000.0,32425.0,38778610329,2025-05-19
10149,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.46,4.7,10000.0,66167.0,38778830161,2025-05-22
9869,"Nate's Organic 100% Pure, Raw & Unfilt...",Nature Nate's,0.53,4.6,10000.0,32471.0,38778610169,2025-05-22
5302,"365 by Whole Foods Market, Organic Lig...",365 by Whole Foods Market,0.58,4.7,10000.0,7977.0,99482446123,2025-05-20
10165,"Amazon Grocery, Raw Wildflower Pure Ho...",Amazon Grocery,0.36,4.7,10000.0,17886.0,842379155444,2025-05-23
1749,Local Hive Wildflower Raw Unfiltered H...,Local Hive Honey,0.77,4.7,5000.0,2161.0,75002120247,2025-05-17
672,"Nate's Georgia 100% Pure, Raw & Unfilt...",Nature Nate's,0.37,4.7,3000.0,3572.0,38778890325,2025-05-17
8605,"365 by Whole Foods Market, Organic Raw...",365 By Whole Foods Market,0.42,4.5,3000.0,4061.0,99482446161,2025-05-22
8585,"Fischer's Clover Honey, 12 Oz – 100% P...",Fischer's,0.41,4.7,3000.0,421.0,11137012125,2025-05-22


- As shown in the chart below, the APPO decreases as a product's popularity increases and the most popular products sell for an average of 49 cents per ounce with monthly sales of 14,000.  

In [290]:
appo_data = VisualizeData(
    most_bought, 
    'average price per ounce', 
    "APPO of Best Selling Products", 
    'Product Ranking',
    'APPO',
    'bar'
)
appo_data.main()

In [291]:
bought_data = VisualizeData(
    most_bought, 
    'average bought last month', 
    "Average Monthly Sales of Top Products", 
    'Product Ranking',
    'Average Monthly Sales',
    'bar'
)
bought_data.main()

- The most sold products also coincide with the highest rated products. This is an important metric as the top 5 products have an average of 13,000 more reviews than the top 10, which makes the rating of top 5 products more significant. 

In [292]:
rating_data = VisualizeData(
    most_bought, 
    'average product rating', 
    "Average Product Rating", 
    'Product Ranking',
    'Average Product Ratin',
    'line'
)
rating_data.main()

- From the plot below, we can see that no product is able to get to the 5,000 monthly sales mark unless they are priced below 80 cents per ounce.  

In [293]:
# Scatter plot to see the relationship between price and sales
fig = px.scatter(
    most_bought, x=most_bought['price_per_ounce'], y=most_bought['bought_last_month'],
    title = 'Comparison of Price and Monthly Sales' ,
    hover_data=['price_per_ounce', 'bought_last_month', 'title'],
    labels={'bought_last_month':"Monthly Sales", 'price_per_ounce': 'Price Per Ounce'}
)
fig.show()

- The most frequent jar size of honey products sold in the top 10 are 16 oz, on the otherhand, the sales volume is dominated by 32 oz honey jars with about 2,400 more monthly sales. Leading to the conclusion that customers will often experiment with buying different 16oz honey products and once they've found a product to meet their standards, they will consistently buy that product in the largest quantity possible.  

In [294]:
# Finding out which size is sold most oftent 
print("Most sold jar size of products in the top 10: ", most_bought['weight'][:10].mode()[0], "oz")

# 16 oz sales volumne 
print("Sales volume of 16oz jars in the top 10: ", most_bought[most_bought['weight'] == 16][:10]['bought_last_month'].mean())

# 32 oz sales volumne
print("Sales volume of 32oz jars in the top 10: ", most_bought[most_bought['weight'] == 32][:10]['bought_last_month'].mean())

# Average sales last month for the top 10 
print("Last Month's Average Sales of Top 10 Products: ", most_bought['bought_last_month'][:10].mean())

Most sold jar size of products in the top 10:  16.0 oz
Sales volume of 16oz jars in the top 10:  3290.0
Sales volume of 32oz jars in the top 10:  5700.0
Last Month's Average Sales of Top 10 Products:  9400.0


### Most Reviewed Products 

- The goal of this section is to analyze the most reviewed products using the same metrics as earlier to see if any different insights can be gained from the most reviewed honey products on the market. 

In [295]:
# The df should not contain null values 
highest_reviews = df[df['num_reviews'].isna() == False]

# Sort the df by review count, with highest shown first 
highest_reviews = highest_reviews.sort_values(by=['num_reviews'], ascending=False)

# Drop any duplicate products 
highest_reviews = highest_reviews.drop_duplicates(subset='product_upc')

# Display the data 
highest_reviews[:10][[
    'title', 'brand', 'price_per_ounce', 'product_rating', 'bought_last_month', 
    'num_reviews', 'product_upc', 'date_acquired'
]]

Unnamed: 0,title,brand,price_per_ounce,product_rating,bought_last_month,num_reviews,product_upc,date_acquired
13055,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.42,4.7,30000.0,66202.0,38778830321.0,2025-05-24
13043,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.45,4.7,10000.0,66202.0,38778830161.0,2025-05-24
9347,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.56,4.7,2000.0,66160.0,38778830130.0,2025-05-22
13119,"Nate's Organic 100% Pure, Raw & Unfilt...",Nature Nate's,0.53,4.6,10000.0,32494.0,38778610169.0,2025-05-24
13062,Nature Nate's 100% Pure USDA Organic R...,Nature Nate's,0.68,4.6,200.0,32494.0,,2025-05-24
13011,"Nate's Organic 100% Pure, Raw & Unfilt...",Nature Nate's,0.48,4.6,10000.0,32494.0,38778610329.0,2025-05-24
10098,Nature Nate's 100% Pure USDA Organic R...,Nature Nate's,0.68,4.6,200.0,32471.0,,2025-05-22
11691,"Amazon Grocery, Raw Wildflower Pure Ho...",Amazon Grocery,0.36,4.7,10000.0,17889.0,842379155444.0,2025-05-24
8581,"365 by Whole Foods Market, Organic Lig...",365 by Whole Foods Market,0.58,4.7,10000.0,7979.0,99482446123.0,2025-05-22
30,Nate's Honey Minis - Single-Serve 100%...,Nature Nate's,0.71,4.6,10000.0,6678.0,38778730201.0,2025-05-15


In [296]:
data = VisualizeData(
    highest_reviews, 
    'average price per ounce', 
    "APPO of Most Reviewed Products", 
    'Product Ranking',
    'APPO',
    'bar'
)
data.main()

In [297]:
data = VisualizeData(
    highest_reviews, 
    'average bought last month', 
    "Average Monthly Sales of Most Reviewed Honey Products", 
    'Product Ranking',
    'Monthly Sales',
    'bar'
)
data.main()

In [298]:
data = VisualizeData(
    highest_reviews, 
    'average product rating', 
    "Average Rating of Most Reviewed Honey Products", 
    'Product Ranking',
    'Average Product Ratin',
    'line',
)
data.main()

In [299]:
# Scatter plot to show the relationship between reviews & Price 
fig = px.scatter(
    highest_reviews, y="num_reviews", x="price_per_ounce", 
    title="Comparison of Most Reviewed Products and Price",
    hover_data=['title', 'num_reviews', 'price_per_ounce', 'bought_last_month']
)
fig.show()

- The same insights that have been drawn from the analysis on the most bought honey products have held up when compared to the most reviewed honey products

### Products mentioning 3rd party testing 

Baron Honey Co. will test it's honey for purity and potency, understanding the pricing of similar products will factor in the executives decision making on setting the product price.  

In [300]:
# Filter the df for products that mention testing 
tested_df = df[(df['title'].str.contains('test')) | (df['product_description'].str.contains('test'))]

# Exclude any duplicate products 
tested_df = tested_df.drop_duplicates(subset='product_upc')

# Calculate the price 
tested_df['price_per_ounce'] = tested_df['price'] / tested_df['weight'] 

# Display the most important information 
tested_df[:10][[
    'title', 'brand', 'price_per_ounce', 'product_rating', 
    'bought_last_month', 'num_reviews', 'product_upc', 'date_acquired'
]]

Unnamed: 0,title,brand,price_per_ounce,product_rating,bought_last_month,num_reviews,product_upc,date_acquired
0,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.498125,4.7,10000.0,66010.0,38778830161.0,2025-05-15
2,HONEY FEAST Wildflower Honey - 6 Pound...,Honey Feast,0.363438,4.5,800.0,555.0,857598008617.0,2025-05-15
4,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.427812,4.7,30000.0,66010.0,38778830321.0,2025-05-15
12,"Nate's Organic 100% Pure, Raw & Unfilt...",Nature Nate's,0.58875,4.7,10000.0,32383.0,38778610169.0,2025-05-15
17,Zeigler's Local Georgia Award Winning ...,ZEIGLER'S,0.5925,4.5,200.0,240.0,34307332328.0,2025-05-15
27,"Nate's Georgia 100% Pure, Raw & Unfilt...",Nature Nate's,0.374063,4.7,3000.0,3568.0,38778890325.0,2025-05-15
30,Nate's Honey Minis - Single-Serve 100%...,Nature Nate's,0.711224,4.6,10000.0,6678.0,38778730201.0,2025-05-15
35,"Nate's Florida 100% Pure, Raw & Unfilt...",Nature Nate's,0.405,4.7,1000.0,838.0,38778850329.0,2025-05-15
40,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.624375,4.6,600.0,304.0,,2025-05-15
88,"Nate's 100% Pure, Raw & Unfiltered Hon...",Nature Nate's,0.565,4.3,50.0,50.0,38778001097.0,2025-05-15


In [301]:
# How many tested products are there 
print("Number of tested honey products: ",len(tested_df['product_upc'].unique()))

Number of tested honey products:  25


In [302]:
# Showing all the brands that test at least one product 
print(tested_df['brand'].unique())

["Nature Nate's" 'Honey Feast' "ZEIGLER'S" "Aunt Sue's" 'Bee Harmony'
 'Beekeeper Direct Honey ESTD 1918' 'Sue Bee' 'Mickelberry Gardens'
 'AKSHAR' 'Banyan Botanicals']


In [303]:
# Number of brands that test their products 
print("Number of brands that test at least one product: ", len(tested_df['brand'].unique()))

Number of brands that test at least one product:  10


In [304]:
data = VisualizeData(
    tested_df, 
    'average price per ounce', 
    "APPO of Tested Honey Products", 
    'Product Ranking',
    'APPO',
    'bar'
)
data.main()

In [305]:
data = VisualizeData(
    tested_df, 
    'average bought last month', 
    "Average Monthly Sales of Tested Honey Products", 
    'Product Ranking',
    'Monthly Sales',
    'bar'
)
data.main()

In [306]:
# Scatter plot to show the relationship between reviews & Price 
fig = px.scatter(
    tested_df, y="bought_last_month", x="price_per_ounce", 
    title="Comparison of Most Reviewed Products and Price",
    hover_data=['title', 'bought_last_month', 'price_per_ounce']
)
fig.show()

### Natural Language Processing Analysis

The internet utilizes search engines to sift through the river of information and find similar products to what the person initially searched. Analyzing the words used by competitors in their listing title and description will give Baron Honey Co. insight in determining what key words should be included in their product listing.

In [307]:
# Instantiate an object to create the word cloud 
wordcloud = WordCloud(
    background_color='white',
    max_words=25,
    height=600,
    width=400
)

# Get the description of top 5 most sold products 
description = most_bought['product_description'][:5].dropna()

# Create a word cloud 
wordcloud.generate(' '.join(description))

# # Save the word cloud as a png 
# wordcloud.to_file('wordcloud.png')

# Save the generated words to a variable 
wc_words = wordcloud.words_

# Iterate the words and round them to two decimals places 
wc_words = {key: round(value, 2) for key, value in wc_words.items()}
wc_words

{'honey': 1.0,
 'Raw Unfiltered': 0.55,
 'Organic': 0.36,
 'sweetener': 0.32,
 'Nate Pure': 0.27,
 'purity guarantee': 0.27,
 'natural': 0.23,
 'Nate': 0.18,
 'nature': 0.18,
 'ingredient': 0.18,
 'add': 0.18,
 'bottle': 0.18,
 'provide': 0.18,
 'care': 0.18,
 'precision': 0.18,
 'make': 0.18,
 'BEST': 0.18,
 'blend': 0.18,
 'crafted': 0.18,
 'nature intended': 0.18,
 'Every bottle': 0.18,
 'uphold strict': 0.18,
 'strict testing': 0.18,
 'testing standards': 0.18,
 'unmatched level': 0.18}

In [308]:
# Get products that are within 10% of 63 cents PPO 
popular_cluster = most_bought[(most_bought['price_per_ounce'] > 0.50) & (most_bought['price_per_ounce'] < 0.70)]

# Get the description of competitors & drop any null values 
description_cluster = popular_cluster['product_description'][:10].dropna()

In [309]:
# Instantiate an object to create the word cloud 
wordcloud_cluster = WordCloud(
    background_color='white',
    max_words=25,
    height=600,
    width=400
)

# Generate a word cloud based on the description 
wordcloud_cluster.generate(' '.join(description_cluster))

# Save the generated words to a variable 
cluster_words = wordcloud_cluster.words_
# Iterate the words and round them to two decimals places 
cluster_words = {key: round(value, 2) for key, value in cluster_words.items()}
cluster_words

{'honey': 1.0,
 'natural': 0.33,
 'organic': 0.28,
 'raw': 0.28,
 'Raw Unfiltered': 0.19,
 'unfiltered honey': 0.19,
 'Pure': 0.18,
 'flavor': 0.18,
 'bottle': 0.16,
 'sweetener': 0.14,
 'taste': 0.12,
 'Georgia': 0.11,
 'bee': 0.09,
 'gift': 0.09,
 'add': 0.07,
 'best': 0.07,
 'care': 0.07,
 'make': 0.07,
 'Enjoy': 0.07,
 'tea': 0.07,
 'yogurt': 0.07,
 'vitamins': 0.07,
 'enzymes': 0.07,
 'rich': 0.07,
 'bowl': 0.07}

## Recommendations 

- To make Baron Honey Co. competitive with other popular products on the market, the pricing should be: 
    - 8 oz jar: 0.73 cents per oz 
    - 16 oz jar: 0.63 cents per oz 
    - 32 oz jar: 0.59 cents per oz 
- Honey stock inventory should have a distribution of 50% 16oz jars, 25% 32oz jars, 25% 8oz jars. Once brand reputation is established the focus should shift to 50% 32oz jars to accomodate repeat customers. 
- Lower all possible costs, such as shipping, transportation, storage to get APPO price close to 50 cents. This will allow Baron Honey Co. to compete with the top 5 best selling honey products on price and Baron Honey will have the advantage with taste.   
- Test Baron Honey's raw honey to understand where it stands on the medicinal rating scale compared to Manuka and other honey products. 
    - If the honey is highly rated it can then be advertised as such and the price per ounce increased substantially. 
- To overcome the high initial shipping costs a single serving product should be offered. 
    - This will increase margins and tap into a new market of long distance athletes, strength athletes, and parents wanting to give their children healthy sweets 
- Title and description should describe our honey as:
    - "Raw Unfiltered" 
    - "Sweetener" 
    - "Pure"
    - "Best tasting"
    - "As Nature Intended"  