<a href="https://colab.research.google.com/github/JakobDuffin/BATC-Data/blob/main/m5_project_JD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 5 Project - Wine Reviews EDA - Jake Duffin

Wine tasting is a mix of objective traits and personal judgment. Reviewers evaluate wines based on appearance, aroma, flavor, and structure, but their written descriptions can also reveal patterns in how wines are scored. In this project, I explored a dataset of Wine Enthusiast reviews to see which characteristics tend to be associated with higher or lower point ratings.

In [None]:
import pandas as pd
df = pd.read_csv('wine.csv')

df.head()

Unnamed: 0,country,description,designation,points,price,province,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


Countries and Reviewers

Using average points by country, I found that England has the highest average score among the listed options. This suggests that wines reviewed from this country tend to score slightly higher on average in this dataset.

When looking at reviewers, Alexander Peartree gives the lowest average scores. This likely reflects individual scoring tendencies rather than wine quality alone, since some reviewers are simply stricter than others, which some may appreciate.

In [None]:
# Question 1

country_av = df.groupby('country')['points'].mean()

round(country_av.sort_values(ascending=False).head(1),2)


Unnamed: 0_level_0,points
country,Unnamed: 1_level_1
England,91.58


In [None]:
# Question 2

taster_av = df.groupby('taster_name')['points'].mean()

round(taster_av.sort_values().head(1),2)


Unnamed: 0_level_0,points
taster_name,Unnamed: 1_level_1
Alexander Peartree,85.86


Variety, Price, and Vintage

Among the wine varieties listed, "Ramisco" has the highest average price. Price can be influenced by rarity, production scale, and reputation, so it does not necessarily mean higher quality.

After extracting the year from the title using a regular expression, I found that 1969 has the highest average points among the choices. It’s worth noting that this result depends on how many wines from each year appear in the dataset.

In [None]:
# Question 3

price_av = df.groupby('variety')['price'].mean()

round(price_av.sort_values(ascending=False).head(1),2)


Unnamed: 0_level_0,price
variety,Unnamed: 1_level_1
Ramisco,495.0


In [None]:
# Question 4

df['year'] = df['title'].str.extract(r'(\d{4})')

df.head()

year_av = df.groupby('year')['points'].mean()

round(year_av.sort_values(ascending=False).head(1),2)

Unnamed: 0_level_0,points
year,Unnamed: 1_level_1
1969,98.0


Review Language and Scores

I looked at whether certain descriptive words in reviews were associated with higher or lower scores.

Reviews containing “depth” tend to receive higher than average points.

Reviews containing “fruity” tend to receive lower than average points.

Reviews containing “herbal” tend to receive lower than average points.

These results suggest that certain descriptors appear more often in higher- or lower-scoring reviews, although the presence of a word alone does not determine the score.

In [None]:
# (Keyword average comparison function)

def keyword_average(keyword):

  points_av = round(df['points'].mean(), 2)

  df[keyword] = df['description'].str.contains(keyword, na=False)

  has_key = df.loc[df[keyword]]

  with_key = round(has_key['points'].mean(),2)

  print('General point average:', points_av)
  print('Featuring', '"' + str(keyword) + '"', 'average:', with_key)



In [None]:
# Question 5

keyword_average('depth')


General point average: 88.45
Featuring "depth" average: 90.11


In [None]:
# Question 6

keyword_average('fruity')


General point average: 88.45
Featuring "fruity" average: 87.61


In [None]:
# Question 7

keyword_average('herbal')


General point average: 88.45
Featuring "herbal" average: 87.47


Review Length and Points

I created a character count for each review and calculated the correlation between review length and points. The correlation value indicates a moderately positive relationship between the number of characters in a review and the score.

This means that longer reviews are not strongly predictive of higher scores, though there may be a slight tendency for higher-scoring wines to receive more detailed descriptions.

In [None]:
# Question 8

df['char_count'] = df['description'].str.len()

corr_value = round(df['char_count'].corr(df['points']), 2)
print('Description character count & point value correlation coeficient is:', corr_value)



Description character count & point value correlation coeficient is: 0.56


Sicily & Sardinia Regional Analysis

Using regular expressions to extract regions from the title, I focused on wines from Sicily & Sardinia. Among the regions listed, "Faro" has the highest average points. This suggests that wines from this region tend to perform best in this subset of the data, though sample sizes may vary.

In [None]:
# Question 9

df['region'] = df['title'].str.extract(r'\((.+)\)')

sicily_filter = df[df['province'] == 'Sicily & Sardinia']

region_av = sicily_filter.groupby('region')['points'].mean()

round(region_av.sort_values(ascending=False),2).head(1)

Unnamed: 0_level_0,points
region,Unnamed: 1_level_1
Faro,94.0


Conclusion

Overall, this analysis shows that wine scores are influenced by a mix of geography, reviewer behavior, and descriptive language. Some text features and review length show small relationships with points, but none act as strong predictors on their own. These results highlight the subjective nature of wine reviews and the importance of context when interpreting ratings.

Limitations

This dataset represents a snapshot in time and may not reflect broader trends across all wines or reviewers. Additionally, correlations found in this analysis do not imply causation and should be interpreted cautiously.