# Overview
![Chocolate](https://www.eatthis.com/wp-content/uploads//media/images/ext/734795089/various-chocolates.jpg)

**Chocolate is one of the most popular candies** in the world. Each year, residents of the United States collectively eat more than 2.8 billions pounds. However, not all chocolate bars are created equal! This dataset contains expert ratings of over 1,700 individual chocolate bars, along with information on their regional origin, percentage of cocoa, the variety of chocolate bean used and where the beans were grown.

**Flavors of Cacao Rating System:**

5= Elite (Transcending beyond the ordinary limits)
4= Premium (Superior flavor development, character and style)
3= Satisfactory(3.0) to praiseworthy(3.75) (well made with special qualities)
2= Disappointing (Passable but contains at least one significant flaw)
1= Unpleasant (mostly unpalatable)

Each chocolate is evaluated from a combination of both objective qualities and subjective interpretation. A rating here only represents an experience with one bar from one batch. Batch numbers, vintages and review dates are included in the database when known.

The database is narrowly focused on plain dark chocolate with an aim of appreciating the flavors of the cacao when made into chocolate. The ratings do not reflect health benefits, social missions, or organic status.

**Flavor** is the most important component of the Flavors of Cacao ratings. Diversity, balance, intensity and purity of flavors are all considered. It is possible for a straight forward single note chocolate to rate as high as a complex flavor profile that changes throughout. Genetics, terroir, post harvest techniques, processing and storage can all be discussed when considering the flavor component.

**Texture** has a great impact on the overall experience and it is also possible for texture related issues to impact flavor. It is a good way to evaluate the makers vision, attention to detail and level of proficiency.

**Aftermelt** is the experience after the chocolate has melted. Higher quality chocolate will linger and be long lasting and enjoyable. Since the aftermelt is the last impression you get from the chocolate, it receives equal importance in the overall rating.

**Overall Opinion** is really where the ratings reflect a subjective opinion. Ideally it is my evaluation of whether or not the components above worked together and an opinion on the flavor development, character and style. It is also here where each chocolate can usually be summarized by the most prominent impressions that you would remember about each chocolate.

Acknowledgements These ratings were compiled by Brady Brelinski, Founding Member of the Manhattan Chocolate Society. For up-to-date information, as well as additional content (including interviews with craft chocolate makers), please see his website: Flavors of Cacao

**Inspiration Where are the best cocoa beans grown? Which countries produce the highest-rated bars? What’s the relationship between cocoa solids percentage and rating?**


> **Do Upvote the Notebook if you like!**

# Code

## Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
import warnings
warnings.filterwarnings("ignore")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Reading the Data

In [None]:
dataset = pd.read_csv("/kaggle/input/chocolate-bar-ratings/flavors_of_cacao.csv")
original_cols = dataset.columns
new_cols = ['company', 'species', 'REF', 'review_year', 'cocoa_percentage',
                'company_location', 'rating', 'bean_type', 'country']
dataset = dataset.rename(columns=dict(zip(original_cols, new_cols)))
dataset.head()

In [None]:
dataset.info()

## Cleaning the Dataset
Since, we see there is only one null value in both **Bean_Type** and **Broad_Bean_Origin**, we remove those rows, since it wont have an effect on our dataset. We also change Percent into a numeric value.

In [None]:
choco = dataset.dropna()
choco.info()

In [None]:
choco['cocoa_percentage']= choco['cocoa_percentage'].str.replace('%','').astype(float)
choco.head()

## Feature Engineering
We create a few new features to help us understand the dataset much better.
The few features which we will be engineering are:
- To find whether the chocolate is domestic or not
- To find whether the chocolate is a blend or if it's pure

In [None]:
## Create a feature to check if the chocolate is domestic or not
choco['is_domestic'] = np.where(choco['country'] == choco['company_location'], 1, 0)
choco['is_domestic'].value_counts()

Therefore, **1591** are not domestic chocolates, rest of them are.

In [None]:
choco['is_blend'] = np.where(np.logical_or(np.logical_or(choco['species'].str.lower().str.contains(',|(blend)|;'),
                      choco['country'].str.len() == 1),
        choco['country'].str.lower().str.contains(',')), 1, 0)
choco['is_blend'].value_counts()

Therefore, **1111** are pure, rest are blends.

## Visualization and Inferences

In [None]:
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.figure_factory as ff
import seaborn as sns

In [None]:
group_labels = ['Cocoa_Percentage']
data= [choco['cocoa_percentage']]
fig = ff.create_distplot(data, group_labels)
fig.show()

Thus, we see that most of the chocolates have about **70%** cocoa.

In [None]:
rating_counts = choco.rating.value_counts()
plt.figure(figsize=(20,5))
sns.barplot(x=rating_counts.index, y=rating_counts.values, palette="Oranges")
plt.xlabel("Rating value")
plt.ylabel("Counts")
plt.title("Most Common Rating Counts");

Thus, we see that Rating value **3.5** is the most common, conveying that most of the chocolates lie in the range between **Satisfactory and Premium**.

In [None]:
fig = px.pie(choco.head(100),names='country',title='Countries')
fig.show()

We see that among the top 100 countries that produce chocolate, the top 3 positions are occupied by **Peru, Venezuela and Ecuador** respectively.

Something quite interesting to note here is that, all of these countries have roots with the Amazon Rainforest, clearly conveying that **The Amazon Rainforest is largest producer of chocolate and cocoa beans in the world**.

In [None]:
fig = px.box(choco, x="is_domestic", y="rating", color_discrete_sequence=px.colors.sequential.Inferno)
fig.show()

Wow, interesting to know that **Domestic chocolates have a poorer ratings** than Non-Domestic chocolates.

Next, we check, which is better. **Blend or Pure**?

In [None]:
fig = px.box(choco, x="is_blend", y="rating",color_discrete_sequence=px.colors.sequential.Viridis)
fig.show()

Quite interesting to know that  **Blend chocolates are better**  than pure chocolates.

In [None]:
flow = pd.crosstab(choco['company_location'],choco['country'])
flow['total'] = flow.sum(axis=1)
flow = flow.sort_values('total', ascending=False)
flow = flow.drop('total', axis=1)

fig, ax = plt.subplots(figsize=[20,5])
sns.heatmap(flow.head(10), cmap='Greens', linewidths=1)
ax.set_title('Goods Flow from Origin to Company Location')

According to the Data, we see that:
- **USA** is the biggest manufacturer of chocolate in the world.
- **Ecuador** is the largest origin of cocoa beans as well as the Biggest Domestic Manufacturer.

In [None]:
flow = pd.crosstab(
    choco['company_location'],
    choco['review_year'],
    choco['rating'], aggfunc='mean'
)
flow['total'] = flow.sum(axis=1)
flow = flow.sort_values('total', ascending=False)
flow = flow.drop('total', axis=1)
fig, ax = plt.subplots(figsize=[20,10])
sns.heatmap(flow.head(20), cmap='magma', linewidths=1)
ax.set_title('Goods Flow from Company Location, Rating over the years')

Thus, we see that the rating for **U.S.A** has always been between **3.0-3.5**. **Canada** has always had a good rating.

However, Mean rating for **U.S.A** and **Australia** is getting better over the years. 

In [None]:
flow = flow.T
fig, ax = plt.subplots(figsize=[20,10])
for c in choco['company_location'].value_counts().head(5).index:
    ax.plot(flow.index, flow[c], label=c)
ax.legend(ncol=1, loc=4)
ax.set_title('Timeline of Cocoa Rating by Company location')
plt.show()

Thus, we see that:
- Ratings for the chocolates manufactured in the **U.S.A** are constantly increasing
- The ratings for the chocolates manufactured in the **U.K** spiked for a short span of time, after which it declined again.
- **Italy** has seen quite a fluctuation in its Cocoa Rating in the past.

In [None]:
blends = pd.crosstab(
    choco['company_location'],
    choco['is_blend'],
    choco['rating'], aggfunc='mean'
)
blends['total'] = blends.max(axis=1)
blends = blends.sort_values('total', ascending=False)
blends = blends.drop('total', axis=1)

fig, ax = plt.subplots(figsize=[20,10])
sns.heatmap(blends.head(25), cmap='viridis', linewidths=2)
ax.set_title('Best Manufacturer by Company Location, Rating Blend/Pureness')


Thus we see that:
- **Chile** is the best Manufacturer by Company Location
- Blends of **Poland, Venezuela and Scotland** are the worst ones
- **Lithuania** And **Bolivia** have better blends over pure ones.


## Conclusion
Thus, we derived quite a few interesting conclusions from the dataset, which have been listed above

**Do upvote if you liked the Notebook!**