# Wine Review Analysis
⌨ Jeran Burget

Wine tasting is the art of evaluating wines through sight, smell, and taste. It involves assessing color, identifying aromas, and analyzing flavors, acidity, and texture. Subjective yet informative, it helps judge quality, age potential, and personal preferences, making it a nuanced exploration of wine's intricate characteristics.

## Analysis
I will perform an exploratory data analysis on a data set of wine reviews and will try to determine which characteristics make a wine great and which ones do not. The first few lines of code will describe the dataset that we're working with.

In [1]:
import pandas as pd
df = pd.read_csv('wine.csv')

In [2]:
df.head()

Unnamed: 0,country,description,designation,points,price,province,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129971 entries, 0 to 129970
Data columns (total 11 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   country                129908 non-null  object 
 1   description            129971 non-null  object 
 2   designation            92506 non-null   object 
 3   points                 129971 non-null  int64  
 4   price                  120975 non-null  float64
 5   province               129908 non-null  object 
 6   taster_name            103727 non-null  object 
 7   taster_twitter_handle  98758 non-null   object 
 8   title                  129971 non-null  object 
 9   variety                129970 non-null  object 
 10  winery                 129971 non-null  object 
dtypes: float64(1), int64(1), object(9)
memory usage: 10.9+ MB


In [4]:
df.describe()

Unnamed: 0,points,price
count,129971.0,120975.0
mean,88.447138,35.363389
std,3.03973,41.022218
min,80.0,4.0
25%,86.0,17.0
50%,88.0,25.0
75%,91.0,42.0
max,100.0,3300.0


### Which country produces wine with the most points, on average?

In [22]:
round(df[['country', 'points']].groupby('country').mean().sort_values(by='points', ascending=False), 2)

Unnamed: 0_level_0,points
country,Unnamed: 1_level_1
England,91.58
India,90.22
Austria,90.1
Germany,89.85
Canada,89.37
Hungary,89.19
China,89.0
France,88.85
Luxembourg,88.67
Australia,88.58


The data shows England produces wine with the most points, on average, with an average score of 91.58.

### Which taster gives the lowest scores (points), on average?

In [23]:
round(df[['taster_name', 'points']].groupby('taster_name').mean().sort_values(by='points'), 2)

Unnamed: 0_level_0,points
taster_name,Unnamed: 1_level_1
Alexander Peartree,85.86
Carrie Dykes,86.4
Susan Kostrzewa,86.61
Fiona Adams,86.89
Michael Schachner,86.91
Lauren Buzzeo,87.74
Christina Pickard,87.83
Jeff Jenssen,88.32
Anna Lee C. Iijima,88.42
Joe Czerwinski,88.54


It seems Alexander Peartree is the most critical, with the lowest average score of 85.86.

### Which variety of wine is the most expensive, on average?

In [12]:
round(df[['variety', 'price']].groupby('variety').mean().sort_values(by='price', ascending=False), 2)

Unnamed: 0_level_0,price
variety,Unnamed: 1_level_1
Ramisco,495.00
Terrantez,236.00
Francisa,160.00
Rosenmuskateller,150.00
Malbec-Cabernet,113.33
...,...
Roscetto,
Sauvignon Blanc-Sauvignon Gris,
Tempranillo-Malbec,
Vital,


Ramisco comes out on top with an average price of $495 per bottle.

### Which year of wines has the best score (points), on average?

First, we need to extract the year from the wine's title using regular expression. Then, we'll save it in a new column on the dataframe.

In [19]:
df['year'] = df['title'].str.extract('(\d{4})')
df['year']


0         2013
1         2011
2         2013
3         2013
4         2012
          ... 
129966    2013
129967    2004
129968    2013
129969    2012
129970    2012
Name: year, Length: 129971, dtype: object

Now that the new column is added, we can do the same calculation as before.

In [24]:
round(df[['year', 'points']].groupby('year').mean().sort_values(by='points', ascending=False), 2)

Unnamed: 0_level_0,points
year,Unnamed: 1_level_1
1969,98.00
1973,96.00
1952,95.50
1927,95.00
1945,95.00
...,...
1856,85.33
1882,85.25
1150,84.50
1492,83.33


According to said calucation, 1969 is the top year, with an average score of 98.

### Do reviews with the word "depth" in them tend to get better than average or worse than average points?

We'll do something similar to what we did with the year and create a new column for any wine that has the word "depth" in it's description.

In [28]:
df['depth'] = df['description'].str.contains('depth')
df['depth']

0         False
1         False
2         False
3         False
4         False
          ...  
129966    False
129967    False
129968    False
129969    False
129970    False
Name: depth, Length: 129971, dtype: bool

Now that we have the new "depth" column, we can do our caluculation.

In [30]:
round(df[['depth', 'points']].groupby('depth').mean().sort_values(by='points', ascending=False), 2)

Unnamed: 0_level_0,points
depth,Unnamed: 1_level_1
True,90.11
False,88.41


### Do reviews with the word "fruity" in them tend to get better than average or worse than average points?

The same extraction process as with "depth", except with fruity this time.

In [31]:
df['fruity'] = df['description'].str.contains('fruity')
df['fruity']

0         False
1          True
2         False
3         False
4         False
          ...  
129966    False
129967    False
129968     True
129969    False
129970    False
Name: fruity, Length: 129971, dtype: bool

In [32]:
round(df[['fruity', 'points']].groupby('fruity').mean().sort_values(by='points', ascending=False), 2)

Unnamed: 0_level_0,points
fruity,Unnamed: 1_level_1
False,88.51
True,87.61


Interestingly, it seems that wine with the word "fruity" in it's description tend to score less points on average.

### Do reviews with the word "herbal" in them tend to get better than average or worse than average points?

You know the drill by now, this time we're checking "herbal" in the wine's description.

In [33]:
df['herbal'] = df['description'].str.contains('herbal')
df['herbal']

0         False
1         False
2         False
3         False
4          True
          ...  
129966    False
129967    False
129968    False
129969    False
129970    False
Name: herbal, Length: 129971, dtype: bool

In [34]:
round(df[['herbal', 'points']].groupby('herbal').mean().sort_values(by='points', ascending=False), 2)

Unnamed: 0_level_0,points
herbal,Unnamed: 1_level_1
False,88.49
True,87.47


As with "fruity", it looks like "herbal" also scores less points on average.

### What is the relationship between number of characters and points?

Next, we're going to see if there is any correlation between the number of characters in a wine review compared to the number of points they rated the wine. First, we need to see how many characters there are in the review, and save it in a new column.

In [36]:
df['length_of_review'] = df['description'].str.len()

Now, we can check the correlation.

In [37]:
df[['length_of_review', 'points']].corr()

Unnamed: 0,length_of_review,points
length_of_review,1.0,0.55776
points,0.55776,1.0


With a postive correlation of 0.55776, this shows there is a moderate positive relationship between the length of a review and the number of points scored.

### Which region in the province of Sicily & Sardinia produces the best wine, on average?

To determine the region that the wine came from, we'll need to extract it from the title of the review with regular expressions.

In [39]:
df['region'] = df['title'].str.extract('\((.+)\)')
df['region']

0                                                      Etna
1                                                     Douro
2                                         Willamette Valley
3                                       Lake Michigan Shore
4                                         Willamette Valley
                                ...                        
129966    Erben Müller-Burggraef) 2013 Brauneberger Juff...
129967                                               Oregon
129968                                               Alsace
129969                                               Alsace
129970                                               Alsace
Name: region, Length: 129971, dtype: object

Next, let's create a new data frame that will just show wine reviews from the province of Sicily & Sardinia.

In [52]:
sicily_sardinia_df = df[df['province'] == 'Sicily & Sardinia']

From this new data frame, we can now calcuate which region scored the highest points, on average.

In [54]:
round(sicily_sardinia_df[['region', 'points']].groupby('region').mean().sort_values(by='points', ascending=False), 2)

Unnamed: 0_level_0,points
region,Unnamed: 1_level_1
Faro,94.0
Moscato di Noto,92.0
Contessa Entellina,91.86
Alghero,91.5
Passito di Pantelleria,91.36
Nasco di Cagliari,91.0
Contea di Sclafani,90.33
Malvasia delle Lipari,90.25
Eloro,90.2
Moscato di Pantelleria,90.0


### Conclusion

I performed an exploratory data analysis on a dataset of wine reviews to determine which characteristics make a wine great and which ones do not.

    The data indicates that England produces wine with the highest average score of 91.58 points.
    Alexander Peartree is the most critical taster, with an average score of 85.86 points.
    Ramisco is the most expensive variety of wine, with an average price of $495 per bottle.
    Wines from the year 1969 received the highest average score of 98 points.
    Reviews containing the word "depth" tend to receive better than average points.
    Conversely, reviews containing the word "fruity" or "herbal" tend to receive lower than average points.
    There is a moderate positive relationship (correlation coefficient of 0.55776) between the length of a review and the number of points scored.
    Among regions in the province of Sicily & Sardinia, the region of Faro stands out with the highest average score of 94 points.

This analysis provides valuable insights into the factors that contribute to the quality and perception of wines.