# Importing and Installing the desired libraries:


In [1]:
# Importing the packages
import numpy as np # for mathgematical functions
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

# Importing CSV file

In [2]:
# Reading the data
df = pd.read_csv('/kaggle/input/candy-data/candy-data.csv')

#Show first few rows
df.head()

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.86,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465


# Data Cleaning

In [3]:
# Check for missing (null) values in each column
df.isnull().sum()

competitorname      0
chocolate           0
fruity              0
caramel             0
peanutyalmondy      0
nougat              0
crispedricewafer    0
hard                0
bar                 0
pluribus            0
sugarpercent        0
pricepercent        0
winpercent          0
dtype: int64

In [4]:
# Checking for duplicate values in the dataset
df.duplicated().sum()

0

In [5]:
# Quick summary statistics of numeric columns
df.describe()

Unnamed: 0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
count,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0,85.0
mean,0.435294,0.447059,0.164706,0.164706,0.082353,0.082353,0.176471,0.247059,0.517647,0.478647,0.468882,50.316764
std,0.498738,0.50014,0.373116,0.373116,0.276533,0.276533,0.383482,0.433861,0.502654,0.282778,0.28574,14.714357
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011,0.011,22.445341
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.22,0.255,39.141056
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.465,0.465,47.829754
75%,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.732,0.651,59.863998
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.988,0.976,84.18029


# Analysing the Dataset using Plotly

In [6]:
# Compute correlation matrix for selected columns
corr_matrix = df[['pricepercent', 'sugarpercent', 'winpercent']].corr()

print("\nCorrelation matrix:")
corr_matrix


Correlation matrix:


Unnamed: 0,pricepercent,sugarpercent,winpercent
pricepercent,1.0,0.329706,0.345325
sugarpercent,0.329706,1.0,0.229151
winpercent,0.345325,0.229151,1.0


1. pricepercent vs. winpercent → +0.345
    * Moderate positive correlation
    * Interpretation: products perceived as higher priced tend to be more popular.
    * This may sound counterintuitive, but it could mean: premium or higher-quality sweets are preferred by consumers.


2. sugarpercent vs. winpercent → +0.229
    * Small positive correlation
    * Interpretation: sweets with higher sugar content are slightly more popular.
    * The effect is smaller than for price, but it’s there: people like sweeter products, but the effect isn’t very strong.


3. pricepercent vs. sugarpercent → +0.330
    * Moderate positive correlation
    * Interpretation: higher-priced products also tend to have higher sugar content.
    * Possibly because richer, sweeter chocolates are also positioned as premium products.


In [7]:
# Correlation of all ingredients with winpercent
corr_matrix_contents = df[[ 'chocolate', 'fruity', 'caramel', 'peanutyalmondy',
       'nougat', 'crispedricewafer', 'winpercent']].corr()

print("\nCorrelation matrix of the ingredients:")
corr_matrix_contents


Correlation matrix of the ingredients:


Unnamed: 0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,winpercent
chocolate,1.0,-0.741721,0.249875,0.377824,0.254892,0.34121,0.636517
fruity,-0.741721,1.0,-0.335485,-0.39928,-0.269367,-0.269367,-0.380938
caramel,0.249875,-0.335485,1.0,0.059356,0.328493,0.213113,0.213416
peanutyalmondy,0.377824,-0.39928,0.059356,1.0,0.213113,-0.017646,0.406192
nougat,0.254892,-0.269367,0.328493,0.213113,1.0,-0.089744,0.199375
crispedricewafer,0.34121,-0.269367,0.213113,-0.017646,-0.089744,1.0,0.32468
winpercent,0.636517,-0.380938,0.213416,0.406192,0.199375,0.32468,1.0


In [8]:
#fig = px.imshow(corr_matrix_contents)
#fig.show()

1. Chocolate
    * Strongest positive correlation with winpercent: +0.64
    * Chocolate products are much more popular.
    * Chocolate is negatively correlated with fruity (-0.74): products tend to be either chocolate or fruity, rarely both.

2. Fruity
    * Negative correlation with winpercent: -0.38
    * Fruity products are less popular overall.
    * Also negatively correlated with caramel, peanut/almond, nougat, crisped rice → confirms products are usually either “chocolate/nutty” or “fruity”, not mixed.

3.  Peanuts/almonds
    * Moderate positive correlation with winpercent: +0.41
    * Nutty candies are liked.
    * Often appear with chocolate (+0.38).

4. Caramel
    * Weak positive correlation with winpercent: +0.21
    * Often combined with nougat (+0.33) and chocolate (+0.25).
    * Adds some popularity boost but less than chocolate or nuts.

5. Nougat
    * Weakest positive correlation with winpercent: +0.20
    * Appears with caramel and chocolate.

6. Crisped rice/wafer
    * Small positive correlation with winpercent: +0.32
    * Adds some crunch factor; also appears mostly with chocolate.

** Summary: **
* Popularity is clearly higher for chocolate-based, often combined with nuts or crisped rice.
* Fruity products, which tend to exclude chocolate and nuts, are generally less popular.

In [9]:
# Correlation of type of candies with winpercent
corr_matrix_contents = df[[ 'hard', 'bar', 'pluribus', 'winpercent']].corr()

print("\nCorrelation matrix of the type of candies:")
corr_matrix_contents


Correlation matrix of the type of candies:


Unnamed: 0,hard,bar,pluribus,winpercent
hard,1.0,-0.265165,0.014532,-0.310382
bar,-0.265165,1.0,-0.593409,0.429929
pluribus,0.014532,-0.593409,1.0,-0.247448
winpercent,-0.310382,0.429929,-0.247448,1.0


**Observations**
1. Hard candies
    * Correlation with winpercent: -0.31
    * Negative correlation: hard candies tend to be less popular.
    * Slight negative correlation with bar (-0.27): products are typically either bar or hard candy, rarely both.
    * Almost no relationship with pluribus (+0.01).

2. Bar
    * Correlation with winpercent: +0.43
    * Positive: bar candies (think chocolate bars, nut bars etc.) are more popular.
    * Strong negative correlation with pluribus (-0.59): candies are usually either individually wrapped pieces (pluribus) or a single bar — not both.
    * Some negative relationship with hard candies (-0.27): product format tends to be either bar or hard candy.

3. Pluribus (multipiece pack)
    * Correlation with winpercent: -0.25
    * Slight negative correlation: pluribus candies (e.g., Skittles, M&Ms) are less popular compared to bars.
    * Strong negative correlation with bar (-0.59): mutually exclusive product formats.

4. winpercent
    * More popular sweets tend to be bars (positive correlation +0.43).
    * Less popular sweets tend to be hard candies (-0.31) or pluribus candies (-0.25).

**Summary:**
* Bar-type candies are more popular among consumers.
* Hard candies and multipiece (pluribus) candies are less popular on average.

In [10]:
# Checking the overall correlation of the contents and type
corr_matrix_overall = df[[ 'chocolate', 'fruity', 'caramel', 'peanutyalmondy',
       'nougat', 'crispedricewafer', 'hard', 'bar', 'pluribus', 'pricepercent', 'sugarpercent', 'winpercent']].corr()

print("\nCorrelation matrix of the overall factors:")
corr_matrix_overall


Correlation matrix of the overall factors:


Unnamed: 0,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,pricepercent,sugarpercent,winpercent
chocolate,1.0,-0.741721,0.249875,0.377824,0.254892,0.34121,-0.344177,0.597421,-0.339675,0.504675,0.104169,0.636517
fruity,-0.741721,1.0,-0.335485,-0.39928,-0.269367,-0.269367,0.390678,-0.515066,0.299725,-0.430969,-0.034393,-0.380938
caramel,0.249875,-0.335485,1.0,0.059356,0.328493,0.213113,-0.122355,0.33396,-0.269585,0.254327,0.221933,0.213416
peanutyalmondy,0.377824,-0.39928,0.059356,1.0,0.213113,-0.017646,-0.205557,0.26042,-0.206109,0.309153,0.087889,0.406192
nougat,0.254892,-0.269367,0.328493,0.213113,1.0,-0.089744,-0.138675,0.522976,-0.310339,0.153196,0.123081,0.199375
crispedricewafer,0.34121,-0.269367,0.213113,-0.017646,-0.089744,1.0,-0.138675,0.423751,-0.224693,0.328265,0.06995,0.32468
hard,-0.344177,0.390678,-0.122355,-0.205557,-0.138675,-0.138675,1.0,-0.265165,0.014532,-0.244365,0.09181,-0.310382
bar,0.597421,-0.515066,0.33396,0.26042,0.522976,0.423751,-0.265165,1.0,-0.593409,0.518407,0.099985,0.429929
pluribus,-0.339675,0.299725,-0.269585,-0.206109,-0.310339,-0.224693,0.014532,-0.593409,1.0,-0.220794,0.045523,-0.247448
pricepercent,0.504675,-0.430969,0.254327,0.309153,0.153196,0.328265,-0.244365,0.518407,-0.220794,1.0,0.329706,0.345325


**Conclusion:**
***What drives popularity:***

    * Chocolate: highest positive correlation with popularity (winpercent): +0.64 → Candies containing chocolate tend to be much more popular.

    * Bar format: +0.43 → people like chocolate/nut/caramel bars.

    * Peanut/almond content: +0.41 → nuts make candies more popular.

    * Crisped rice wafer: +0.32 → adds crunch, also increases popularity.

    * Higher price percentile: +0.35 → more expensive sweets slightly tend to be more popular.

    * More sugar: +0.23 → slight positive relationship.

So, chocolate, nuts, bar shape, crunchy textures and premium positioning all add to popularity

***What reduces popularity:***

    * Fruity: -0.38 → fruit-based candies tend to be less popular.

    * Hard candies: -0.31 → less preferred.

    * Pluribus (multipiece packs): -0.25 → people prefer single large bars over multipiece packs.

***Business recommendation based on this matrix:***
The most successful sweets are:
Chocolate-based, often with nuts, in a bar format, possibly with crunchy elements (like crisped rice), positioned as premium (slightly higher price). Avoid new fruity or hard candy products if you want to maximize popularity.

# Deeper analysis with multiple plots

In [11]:
fig=px.pie(df,names="chocolate",color_discrete_sequence=[ "steelblue","darkblue"])
fig.update_layout(title_text='percentage of candies containing chocolate', title_x=0.5)
fig.show()


The majority of the candies do not contain chocolate. 

In [12]:
fig=px.scatter(df,color='chocolate',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile with respect to the presence of chocolate in candies",color_discrete_sequence=[ [ "steelblue","darkblue"]])
fig.show()

From the above graph, it can be roughly inferred that for candies containing chocolate:

* Have sugar percentile ranging from 0.3 to 0.9

* The win percentage is on the higher side of the spectrum

* Dominated by candies that have a high price percentile

And for candies not containing chocolate, the following can be inferred:

* The sugar percentile ranges from 0 to 1

* The win percentage is in the range of 20% to 57%

* Most of the candies in this category are of a lower price percentile

In [13]:
fig=px.pie(df,names="fruity",color_discrete_sequence=[ "steelblue","darkblue"])
fig.update_layout(title_text='percentage of candies containing fruits', title_x=0.5)
fig.show()

In [14]:
fig=px.scatter(df,color='fruity',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile with respect to the presence of fruit in candies",color_discrete_sequence=[ [ "steelblue","darkblue"]])
fig.show()

From the graph, it can be observed that for the candies containing fruit :

* less expensive as compared to the candies not containing fruit

* have a win percentage ranging from 20% to 57%

* have the sugar percentile populated between 0.2 and 0.9

For candies devoid of fruit content, the following can be inferred:

* More expensive than those candies containing fruit

* Have the sugar percentile ranging from 0.3 to 1 percentile

In [15]:
fig=px.pie(df,names="caramel",color_discrete_sequence=[ "steelblue","darkblue"])
fig.update_layout(title_text='percentage of candies containing caramel', title_x=0.5)
fig.show()


In [16]:
fig=px.scatter(df,color='caramel',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile with respect to the presence of caramel in candies",color_discrete_sequence=[ "steelblue","darkblue"])
fig.show()

From the graph, it can be observed that for the candies containing caramel :

* More expensive than those not containing caramel

* The win percentage roughly ranges from 55% to 78%

* The sugar percentile ranges from 0.5 to 1 percentile

For the candies devoid of caramel, the following can be inferred:

* Less expensive than candies containing caramel

* The win percentage ranges from 20% to 76%

* The sugar percentile varies from 0 to 0.9 percentile

In [17]:
fig=px.pie(df,names="peanutyalmondy",color_discrete_sequence=[ "steelblue","darkblue"])
fig.update_layout(title_text='percentage of candies containing peanut and almond', title_x=0.5)
fig.show()


In [18]:
fig=px.scatter(df,color='peanutyalmondy',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile with respect to the presence of peanut/almond in candies",color_discrete_sequence=[ "steelblue","darkblue"])
fig.show()


From the graph, it can be observed that for the candies containing peanuts or almonds :

* Have a higher price percentile of about 0.5 to 0.65

* The win percentage ranges from 46% to 77%

* The sugar percentile ranges approximately from 0.3 to 0.8 

For candies devoid of peanuts and almonds, the following can be inferred:

* Have a variety of price percentile

* The win percentage ranges from 23% to 75% approximately

* The sugar percentile is spread from 0 to 1

In [19]:
fig=px.pie(df,names="nougat",color_discrete_sequence=[ "steelblue","darkblue"],title="percentage of candies containing nougat")
fig.update_layout(title_text='percentage of candies containing caramel', title_x=0.5)
fig.show()


In [20]:
fig=px.scatter(df,color='nougat',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile with respect to the presence of nougat in candies",color_discrete_sequence=[ "steelblue","darkblue"])
fig.show()


From the graph, it can be observed that for the candies containing nougat :

* Have win percentage ranging from 38% to 76%

* Have a restricted sugar percentile of 0.46 to 0.7

* Have a mediocre price percentile of 0.44 to 0.76

For candies not containing nougat, the following points can be inferred:

* The win percentage ranges from 22% to 85%

* Have sugar percentile spread across 0 to 1

* The price percentile of candies ranges from 0.03 to 0.97 percentile

In [21]:
fig=px.pie(df,names="crispedricewafer",color_discrete_sequence=[ "steelblue","darkblue"])
fig.update_layout(title_text='percentage of candies containing crisped rice/wafer/cookie', title_x=0.5)
fig.show()


In [22]:
fig=px.scatter(df,color='crispedricewafer',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile with respect to the presence of crisped rice/wafer/cookie in candies",color_discrete_sequence=[ "steelblue","darkblue"])
fig.show()

From the above graph, the following can be inferred about the candies containing crisped rice/wafer/cookie components:

* The price percentile ranges from 0.51 to 0.91 percentile

* The sugar percentile ranges from 0.3 to 0.84 percentile

* The win percentage spans from 50 % to 81%

For candies not containing crisped rice/wafer/cookie components, the following points are observed:

* The price percentile ranges from 0.02 to 0.97

* The sugar percentile ranges from 0 to 1

* The win percentage roughly ranges from 22 % to 84 %

In [None]:
px.bar(df,x="hard",color_discrete_sequence=[ "steelblue"],title="Comparison of Hard and Soft Candies")


In [None]:
fig=px.scatter(df,color='hard',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile with respect to the hardness/softness in candies",color_discrete_sequence=[ "steelblue","darkblue"])
fig.show()


From the above graph, the following points are observed about soft candies:

* Have a win percentage ranging from 22% to 84% hence they are more popular than hard candies

* The sugar percentile is spread evenly from 0 to 1 percentile

* The price percentile ranges from 0.02 to 0.97 percentile hence soft candies are more expensive than hard candies

For hard candies, the following points can be inferred:

* Have win percentage ranging from 28% to 55%

* The sugar percentile is spread evenly from 0 to 1 percentile

* The price percentile ranges from 0.116 to 0.5 percentile hence they are cheaper than soft candies

In [None]:
px.bar(df,x="bar",color_discrete_sequence=[ "steelblue"],title="Comparison of Bar and non-Bar Candies")


In [None]:
fig=px.scatter(df,color='bar',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile by considering form of candies(Bar/Non-Bar)",color_discrete_sequence=[ "steelblue","darkblue"])
fig.show()


The following points are observed for bar candies:

* Have a higher price percentile hence they are more expensive than non-bar candies

* Have sugar percentile ranging from 0.3 to 0.8 percentile

* The win percentage ranges from 46% to 77 % roughly

For the non-bar candies, the following points can be inferred:

* Have a lower price percentile hence they are cheaper than bar candies

* Have a sugar percentile spread across 0 to 1 percentile

* The win percentage ranges from 22% to 73%

In [None]:
fig = px.pie(df,names="pluribus",color_discrete_sequence=[ "steelblue","darkblue"])
fig.update_layout(title_text='Comparison of Candies available in single/pack form', title_x=0.5)
fig.show()


In [None]:
fig=px.scatter(df,color='pluribus',y='sugarpercent',x='winpercent',size='pricepercent',title="Analysis of sugar percentile,win percentage and price percentile by considering the packing of candies(Single/Pack)",color_discrete_sequence=[ "steelblue","darkblue"])
fig.show()


From the above graph, the following can be inferred about the candies available as a single unit:

* They are more expensive than those candies available in a pack

* The win percentage ranges from 27% to 84%

* The sugar percentile stretches from 0 to 1 percentile

For the candies available in a pack, the following points are observed:

* They are cheaper than the candies available as a single unit

* The win percentage ranges from 22% to 73%

* The sugar percentile stretches from 0 to 1 percentile