# Wine Tasting Reviews

This dataset is a compilation of different reviews for various wines from various wineries. The dataset was compiled by "zackthoutt" on Kaggle.com. We will be using his dataset to answer some questions during this exploratory analysis:

1. Which country produces the best wine? (Dictated by points)
2. Which tasters give higher scores? Or lower ones?
3. Do comments or descriptions affect the score of the wine? Does the length of a comment affect its score?
4. Which region of Sicily & Sardinia produce the best wine?
5. Which wine is the most expensive?

These questions will help us get a better understanding of the quality of wine in these countries, and can tell us which wines are best.

In [1]:
import pandas as pd
df = pd.read_csv('/content/wine.csv')

In [2]:
df.head()

Unnamed: 0,country,description,designation,points,price,province,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   taster_name            103727 non-null  object 
 7   taster_twitter_handle  98758 non-null   object 
 8   title                  129971 non-null  object 
 9   variety                129970 non-null  object 
 10  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 10.9+ MB


In [4]:
df.describe()

Unnamed: 0,points,price
count,129971.0,120975.0
mean,88.447138,35.363389
std,3.03973,41.022218
min,80.0,4.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,3300.0


Here is some of our data. We can see there are several different types of wine. It's produced in many places, has various blends of ingredients, and has many reviews. Let's begin answering our first question: Which country produces the highest rated wine?

In [5]:
# Which country produces the highest rated wine?

df.groupby('country')['points'].mean().sort_values(ascending = False).head(5)

Unnamed: 0_level_0,points
country,Unnamed: 1_level_1
England,91.581081
India,90.222222
Austria,90.101345
Germany,89.851732
Canada,89.36965


Using the code we wrote, we are able to narrow down and sort the various countries by both region and average point value. Based off the results of the code we ran, we can see that England has the highest rated wine, on average. But what about the people who rated these wines? What can we deduce from their ratings?

In [27]:
# Which taster gives the lowest scores (points), on average?

df.groupby('taster_name')['points'].mean().sort_values(ascending = True)

Unnamed: 0_level_0,points
taster_name,Unnamed: 1_level_1
Alexander Peartree,85.855422
Carrie Dykes,86.395683
Susan Kostrzewa,86.609217
Fiona Adams,86.888889
Michael Schachner,86.907493
Lauren Buzzeo,87.73951
Christina Pickard,87.833333
Jeff Jenssen,88.319756
Anna Lee C. Iijima,88.415629
Joe Czerwinski,88.536235


According to this bit of data, Alexander Peartree seems to be more critical of the wines than the other tasters. He rated lower scores on average than the others. Let's continue with analyzing the wine.

In [28]:
# Which variety of wine is the most expensive, on average?

df.groupby('variety')['price'].mean().sort_values(ascending = False)

Unnamed: 0_level_0,price
variety,Unnamed: 1_level_1
Ramisco,495.000000
Terrantez,236.000000
Francisa,160.000000
Rosenmuskateller,150.000000
Malbec-Cabernet,113.333333
...,...
Roscetto,
Sauvignon Blanc-Sauvignon Gris,
Tempranillo-Malbec,
Vital,


Regarding the most expensive wines, two in particular stand out. The Ramisco and Terrantez are significantly more expensive than all the other varieties. They seem to be a outliers among the more common varieties. Though as you can see from the line of code below, Ramisco is not the highest rated. Terrantez seems to have better value and a higher rating overall.

In [29]:
df.groupby('variety')['points'].mean().sort_values(ascending = False)

Unnamed: 0_level_0,points
variety,Unnamed: 1_level_1
Terrantez,95.000000
Tinta del Pais,95.000000
Gelber Traminer,95.000000
Bual,94.142857
Sercial,94.000000
...,...
Shiraz-Tempranillo,82.000000
Aidani,82.000000
Picapoll,82.000000
Airen,81.666667


In [8]:
# Which year of wines has the best score (points), on average?

df['year'] = df['title'].str.extract('(\d{4})')

df.groupby('year')['points'].mean().sort_values(ascending = False).head(5)

Unnamed: 0_level_0,points
year,Unnamed: 1_level_1
1969,98.0
1973,96.0
1952,95.5
1927,95.0
1945,95.0


Here we see the years for the wine in the dataset. It seems wines from the year 1969 have the highest ratings overall. Not the oldest, nor the newest.

# Descriptions

Now we are reaching the descriptions from the wine tasters. Here we will see if the following things affect the rating:

the word "depth" being used,
the word "fruity" being used,
the word "herbal" being used,
 and the length of the description.

In [9]:
# Do reviews with the word "depth" in them tend to get better than average or worse than average points?

df['depth'] = df['description'].str.contains('depth')

df.groupby('depth')['points'].mean().sort_values(ascending = False).head(10)

Unnamed: 0_level_0,points
depth,Unnamed: 1_level_1
True,90.109412
False,88.413872


In [10]:
# Do reviews with the word "fruity" in them tend to get better than average or worse than average points?

df['fruity'] = df['description'].str.contains('fruity')

df.groupby('fruity')['points'].mean().sort_values(ascending = False).head(10)

Unnamed: 0_level_0,points
fruity,Unnamed: 1_level_1
False,88.509749
True,87.614521


In [11]:
# Do reviews with the word "herbal" in them tend to get better than average or worse than average points?

df['herbal'] = df['description'].str.contains('herbal')

df.groupby('herbal')['points'].mean().sort_values(ascending = False).head(10)

Unnamed: 0_level_0,points
herbal,Unnamed: 1_level_1
False,88.48925
True,87.470019


In [36]:
# What is the relationship between number of characters (description) and points?

df['description_length'] = df['description'].str.len()

correlation = df[['description_length','points']].corr()

print(correlation)

                    description_length   points
description_length             1.00000  0.55776
points                         0.55776  1.00000


After running those lines of code, we can now answer the previous questions:

1. Do reviews with the word "depth" in them tend to get better than average or worse than average points? (Yes)
2. Do reviews with the word "fruity" in them tend to get better than average or worse than average points? (No)
3. Do reviews with the word "herbal" in them tend to get better than average or worse than average points? (No)
4. What is the relationship between number of characters (description) and points? (There is a moderate correlation between length and points given)

In [25]:
# Which region in the province of Sicily & Sardinia produces the best wine, on average?

df['region'] = df['title'].str.extract('\((.+)\)')

df_ss = df[df['province'] == 'Sicily & Sardinia']

df_ss.groupby('region')['points'].mean().sort_values(ascending = False).head()

Unnamed: 0_level_0,points
region,Unnamed: 1_level_1
Faro,94.0
Moscato di Noto,92.0
Contessa Entellina,91.857143
Alghero,91.5
Passito di Pantelleria,91.363636


# Closing Remarks

And finally, which region in Sicily & Sardinia produces the best wine? After looking through the different descriptions, prices, varieties, tasters, and points assigned to all, We can see that Faro is the region that produces the best wine, on average.