In [None]:
import numpy as np
np.warnings.filterwarnings('ignore')

import matplotlib as mpl
import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'
import seaborn as sns

import pandas as pd
import re

In [None]:
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

# Scotch Whisky Reviews - Basic Exploratory Data Analysis 🥃🥃

<img src = "https://imgur.com/DAbTcgV.jpg" width="500">

<center>Original photo by Josh Applegate (Unsplash)</center>

**Table of Contents**

1. [Introduction](#Introduction)
2. [Dataset](#Dataset)
3. [Loading and Processing Data](#Loading-and-Processing-Data)
4. [Exploratory Data Analysis](#Exploratory-Data-Analysis)
5. [Trivia](#Trivia)
6. [Conclusions](#Conclusions)

## Introduction

Whisky (or whiskey) is a type of distilled alcoholic beverage made from just three natural ingedients - barley, water and yeast. Whisky is distilled throughout the world, most popularly in Scotland, Ireland, the United States, and Japan. It comes in many styles and some countries have strict regulations regarding its production. For instance, a whisky can be classified as 'Scotch' if it is distilled and matured in Scotland for at least three years and bottled at a minimum alcoholic strength of 40% ABV ([Source](https://www.scotch-whisky.org.uk/discover/enjoying-scotch/scotch-whisky-categories/)).

Whether it's Scotch, Irish, Japanese or bourbon, whisky is the most popular drink in the word and can be enjoyed on its own or be used in cocktails. You can read more [here](https://www.thespruceeats.com/history-of-whisky-1807685).

<br><br>

**This notebooks** explores a dataset containing reviews of different bottles of Scotch whisky (simply whisky from now on). Specifically, we will:

- gain insights on their price, age, alcohol percentage etc.,
- compare different categories, 
- study the reviews given to each category, and
- answer some interesting questions such as 'which is the most expensive bottle?' or 'which bottle is the best bargain?'.

## Dataset

The dataset includes information about more than 2200 whisky bottles from Scotland and was uploaded by [thatdataanalyst](https://www.kaggle.com/koki25ando) on kaggle. The dataset was scraped from the website [Whisky Advocate](https://www.whiskyadvocate.com/).

## Loading and Processing Data

We start by importing the data into a Pandas DataFrame and looking at the top five rows using the `head()` method:

In [None]:
df = pd.read_csv('../input/22000-scotch-whisky-reviews/scotch_review.csv', index_col = 0)
df.index = df.index - 1   #remove 1 so that the index starts from 0

print('This dataset has {} rows and {} columns.'.format(df.shape[0], df.shape[1]))
df.head()

The meaning of each column is the following:

- **name**: Name of the bottle,
- **category**: Its category,
- **review.point**: Points marked by each reviewer,
- **price**: Price of the bottle,
- **currency**: Unit of price, and
- **description**: Descriptions of reviews.

We can rename the 'review.point' attribute to just 'points':

In [None]:
df.rename(columns = {'review.point': 'points'}, inplace = True)
df.columns

### Missing Values and Type Conversions

The `info()` method can give us valuable information such as the number of missing values and the type of each attribute:

In [None]:
df.info()

Thankfully there are no missing values in our dataset. 

There five categorical attributes and one numeric. The 'price' column should be numeric, so we can change by using the `astype()` method. However, this will yield an error if we try it because because some features contain symbols such as commas, '/set', etc. We can easily confirm this:

In [None]:
symbol_idx = pd.to_numeric(df['price'], errors = 'coerce').isnull() # errors = 'coerce' results in NaNs for non-numeric values
df[symbol_idx][['name','price']].head()

I could write a function to automatically change these values, but since there aren't that many I have decided to change them manually:

In [None]:
df.at[[19, 95, 410, 1000, 1215], 'price'] = 15000   # instances with '60,000/set' which equals 15000 dollars
df['price'].replace('/liter', '', inplace = True, regex = True) # this bottle was actually 1 lt, so we don't need the price per litre
df['price'].replace(',', '', inplace = True, regex = True)

df['price'] = df['price'].astype('float')

### Currency

We can check if there are more than one currencies in our dataset:

In [None]:
df['currency'].value_counts()

There is only one, therefore we can delete it since it doesn't give any extra information:

In [None]:
df.drop('currency', axis = 1, inplace = True)

### New Attributes

#### Price per Points

We can calculate the price to points ratio:

In [None]:
df['price_p_points'] = df['price']/df['points']
df.head()

#### Age and Alcohol Percentage

Finally, the 'name' column sometimes contains useful information such as the age and the alcohol percentage of each bottle. We can extract those using regular expressions:

In [None]:
df['age'] = df['name'].str.extract(r'(\d+) year')[0].astype(float) # extract age and convert to float

df['name'] = df['name'].str.replace(' ABV ', '')
df['alcohol%'] = df['name'].str.extract(r"([\(\,\,\'\"\’\”\$] ? ?\d+(\.\d+)?%)")[0]
df['alcohol%'] = df['alcohol%'].str.replace("[^\d\.]", "").astype(float) # keep only numerics and convert to float

df[['name', 'age', 'alcohol%']].sample(10, random_state = 42)

Our method is accurate. Some bottles do not have the number of years in their name, hence the NaN values. Let's check the number of missing values for our new attributes:

In [None]:
df[['age', 'alcohol%']].isnull().sum()

<br>

## Exploratory Data Analysis

We can use the `describe()` method to get a (statistical) overview of the numerical attributes:

In [None]:
df.describe()

We will discuss the extremes (max, min) of each attribute later. Apart from that, the most important points to note are:

- For the 'price' attribute, the mean is more than 550 dollars, while its median is only 110.
- The age starts from 3 years, which is the minimum maturation time before a spirit can be called (Scotch) whisky ([Source](https://scotchwhisky.com/magazine/ask-the-professor/8033/whisky-and-maturity-why-three-years/)),
- Alcohol content starts from 40% which is minimum percentage for a (Scotch) whisky according to Scotish regulations ([Source](https://flaviar.com/blog/alcohol-content-in-whisky))

The last two points prove that we successfully extracted the age and alcohol content. 

We can quickly visualise these information by plotting a histogram for each numerical value:

In [None]:
attributes = ['price', 'points', 'age', 'alcohol%']  # price_p_points not that important

df[attributes].hist(figsize = (15, 10), color = 'firebrick');

---

### Whisky Categories



In [None]:
df['category'].value_counts()

There are only five categories of whiskies in our dataset. Let's use a bar plot to visualise how many instances belong to each category:

In [None]:
colors = ['#BB342F', '#EDAFB8', '#666A86', '#95B8D1', '#D1D0A3']
categories_index = df['category'].value_counts().index

fig = plt.figure(figsize = (7, 4))
sns.countplot(y = 'category', data = df, palette = colors, order = df['category'].value_counts().index)

# include the percentage of each category next to each bar
for index, value in enumerate(df['category'].value_counts()):
    label =  '{}%'.format(round( (value/df['category'].shape[0])*100, 2)) 
    plt.annotate(label, xy = (value + 11, index + 0.1), color = colors[index])

plt.title('Number of bottles by category (with percentages)')
plt.ylabel('Category')
plt.xlabel('Count');

- More than 80% of all whiskies in our dataset are Single Malts,
- Blended whisky and Blended Malt whisky come second and third, corresponding to approximately 9% and 6% of the dataset, respectively, and
- Single Grain and Grain whisky combined correspond to less than 4% of the dataset.

Now, let's compare their (mean) price and points:

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (13, 5))

ax1.set_title('Mean price by category')
sns.barplot(x = 'price', y = 'category', data = df, order = categories_index, palette = colors, ax = ax1)

ax2.set_title('Mean points by category')
sns.barplot(x = 'points', y = 'category', data = df, order = categories_index, palette = colors, ax = ax2)
ax2.set(yticklabels = [])
ax2.set_ylabel('')
ax2.set_xlim(80, 89);

- The error bar for Blended whiskies is quite big, so we should probably not draw any conclusions using the mean (we could try the median of each category - see boxplot).
- On average, Blended Malt whiskies score higher with Blended whiskies following closely.

We can use Seaborn's `boxplot()` to visualize the distribution of values for 'price' and 'points' (we have to remove outliers for 'price'):

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (15, 6))

sns.boxplot(x = 'price', y = 'category', data = df, order = categories_index, palette = colors, showfliers = False, ax = ax1)
sns.boxplot(x = 'points', y = 'category', data = df, order = categories_index, palette = colors, ax = ax2)
ax2.set(yticklabels = [])
ax2.set_ylabel('')

plt.tight_layout();

We observe that:

- Blended and Blended malt whiskies have a similar median price which is approximately 50 dollars lower than the rest of the categories,
- The median points have a small range of values: from 86 (Single Grain) to 88 (Blended Malts),
- The boxplot for 'price' has a positive skew (median is closer to the bottom of the box) for all categories, and
- The distribution of 'points' resembles more a normal distribution.

We can perform a similar analysis for the 'age' and 'alcohol%':

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (13, 5))

ax1.set_title('Mean age by category')
sns.barplot(x = 'age', y = 'category', data = df, order = categories_index, palette = colors, ax = ax1)

ax2.set_title('Mean alcohol% by category')
sns.barplot(x = 'alcohol%', y = 'category', data = df, order = categories_index, palette = colors, ax = ax2)
ax2.set(yticklabels = [])
ax2.set_ylabel('')
ax2.set_xlim(30, 55);

Single Grain and Grain whiskies are the oldest and most alchoholic bottles of our dataset compared to the other three categories..

---

### Description Attribute

We can analyse the 'description' columns by using word clouds.

For this purpsose, we need a package called `word_cloud` which was developed by **Andreas Mueller** ([link](https://github.com/amueller/word_cloud/)). For simplicity, we are going to analyse only the two most common categories, Single Matls and Blended Whiskies.

In [None]:
from wordcloud import WordCloud, STOPWORDS

def create_word_cloud(df, bg_color, max_words, mask, stop_words, max_font_size, colormap):
    
    wc = WordCloud(background_color = bg_color, max_words = max_words, mask = mask, stopwords = stop_words, max_font_size = max_font_size)
    wc.generate(' '.join(df))
    
    return wc.recolor(colormap = colormap, random_state = 42)

In [None]:
sm_df = df[df['category'] == 'Single Malt Scotch']['description'].values
bd_df = df[df['category'] == 'Blended Scotch Whisky']['description'].values

The package also allows us to superimpose the cloud onto a mask of any shape. In our case, the masks are going to be the flag of Scotland and a whisky bottle.

Let's import the two png images:

In [None]:
from PIL import Image
import requests # https://stackoverflow.com/questions/12020657/how-do-i-open-an-image-from-the-internet-in-pil/12020860

mask_sm = np.array(Image.open(requests.get('https://imgur.com/dAYKIVT.png', stream = True).raw))  # mask for single malts
mask_bd = np.array(Image.open(requests.get('https://imgur.com/upL1TBW.png', stream = True).raw))  # mask for blended whiskies

In [None]:
stop_words = list(STOPWORDS)

fig, ax = plt.subplots(2, 1, figsize = (7, 14))

ax[0].imshow(create_word_cloud(df = sm_df, bg_color = 'white', max_words = 50, mask = mask_sm, stop_words = stop_words,
                             max_font_size = 50, colormap = 'winter'), alpha = 1, interpolation = 'bilinear')
ax[0].set_title('Single Malts - Initial', size = 16, y = 1.06)
ax[0].axis('off')

ax[1].imshow(create_word_cloud(df = bd_df, bg_color = 'white', max_words = 50, mask = mask_bd, stop_words = stop_words,
                             max_font_size = 50, colormap = 'tab10'), alpha = 1, interpolation = 'bilinear')
ax[1].set_title('Blended Whiskies - Initial', size = 16, y = 1.04)
ax[1].axis('off');

However, words such as whisky, palate, note etc. aren't really informative. So let's add them our stopwords and re-generate the cloud:

In [None]:
stop_words = ['whisky', 'whiskies', 'blend', 'note', 'notes', 'year', 'years', 'old', 'nose', 'finish', 'bottle',
              'bottles', 'bottled', 'along', 'release', 'flavor', 'cask', 'well', 'make', 'mouth', 'palate', 'hint',
              'one', 'bottling', 'distillery', 'quite', 'time', 'date', 'show', 'first'] + list(STOPWORDS)

In [None]:
fig, ax = plt.subplots(2, 1, figsize = (7, 14))

ax[0].imshow(create_word_cloud(df = sm_df, bg_color = 'white', max_words = 50, mask = mask_sm, stop_words = stop_words,
                             max_font_size = 50, colormap = 'winter'), alpha = 1, interpolation = 'bilinear')
ax[0].set_title('Single Malts - Improved', size = 16, y = 1.06)
ax[0].axis('off')

ax[1].imshow(create_word_cloud(df = bd_df, bg_color = 'white', max_words = 50, mask = mask_bd, stop_words = stop_words,
                             max_font_size = 50, colormap = 'tab10'), alpha = 1, interpolation = 'bilinear')
ax[1].set_title('Blended Whiskies - Improved', size = 16, y = 1.04)
ax[1].axis('off');

Therefore, the most common tastes for Single Malts are ‘vanilla’, ‘fruit’, ‘spice’, ‘ginger’ etc, while for Blended whiskies they are ‘spice’, ‘vanilla’, ‘toffee’, ‘fruit, ‘honey’, ‘orange’ etc.

## Trivia

Now we can answer some interesting questions:

1) What's the most expensive whisky bottle?

In [None]:
df.sort_values(by = 'price', ascending = False).head()

Diamond Jubilee comes first with a price of 157,000 dollars. Interestingly, it is almost 100,000 dollars more expensive than the second whisky of this list, Dalmore 50 year old. Bowmore distillery has two bottles in the top5 list.

<img src="https://imgur.com/I1XgFOb.jpg" width="500">
<center>Photo taken from www.couturing.com</center>

2) What's the least expensive whisky bottle?

In [None]:
df.sort_values(by = 'price', ascending = True).head()

Monarch of the Glen is cheapest option with a price of only 12 dollars. 

<img src="https://imgur.com/AZ1AxYI.png" width="300">
<center>Photo taken from www.sortitapps.com</center>

3) What's the best reviewed bottle? 

In [None]:
df.sort_values(by = 'points', ascending = False).head()

Three whiskies almost touch perfection, scoring 97 point! The most affordable of the three is Johnnie Walker Blue Label.

<img src="https://imgur.com/W1KANyR.png" width="300">
<center>Photo taken from www.thewhiskyexchange.com</center>

4) Which whisky give the best value for money?

In [None]:
df.sort_values(by = 'price_p_points', ascending = True).head()

Carlyle 40% is a bargain since you can enjoy a total of 88 points for only 13 dollars.

5) What's the best bottle for under 50 dollars?

In [None]:
df[(df['points'] > 85) & (df['price'] < 50)].sort_values(by = 'points', ascending = False).head()

<img src="https://imgur.com/SlOK0IE.png" width="400">
<center>Photo taken from www.thewhiskyworld.com</center>

6) What's the oldest bottle?

In [None]:
df.sort_values(by = 'age', ascending = False).head()

<img src="https://imgur.com/bavhilf.png" width="300">
<center>Photo taken from www.thewhiskyexchange.com</center>

The Gordon & MacPhail Generations: Glenlivet is the oldest bottle with an age of 70, and exceeds the second oldest bottle by 10 years.

7) What's the most alcoholic bottle?

In [None]:
df.sort_values(by = 'alcohol%', ascending = False).head()

<img src="https://imgur.com/9PxAcg3.png" width="300">
<center>Photo taken from www.whiskyauctioneer.com</center>

Adelphi 7 year old contains more than 66% alcohol! Interestigly, the other bottles in the top five list don't have less than 64% alcohol.

## Conclusions

In this notebook, we used a dataset containing reviews for more than 2K whisky bottles, to discover useful and interesting facts related to the world of Scottish whisky. The most important things to remember are:
- There are five categories of whisky: Single malts, Blended whiskies, Blended malts, Single Grain whiskies and Grain whiskies. The most common category in our dataset is Single malts,
- Blended Malt whiskies receive a higher score on average,
- Single Grain and Grain whiskies have on average a higher age and alcohol content,
- World clouds can help us visualize (and summarise) the descriptions given for each whisky category, and
- The `sort_values()` method allows us to look at the ‘extremes’, and answer questions such as ‘what's the most expensive bottle?’, ‘what's the most alcoholic bottle’ etc.
