# Analysis on Wine Reviews
## —130k wine reviews with variety, location, winery, price, and description
### -Created by Jinning Yan
### -Date: July 16th, 2023
<img src="https://static01.nyt.com/images/2023/02/08/multimedia/08pour-01-fmlw/08pour-01-fmlw-videoSixteenByNine3000.jpg" width="500" height="500">

### Warning Setting

In [None]:
import warnings
warnings.filterwarnings('ignore')

### Libraries & Dataset

In [None]:
# Basi libraries to manipulate data
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import plotly.express as px
import plotly.graph_objs as go
import plotly.io as pio
from plotly.subplots import make_subplots

pio.renderers.default = "plotly_mimetype+notebook"

# Libraries for text mining
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
from collections import Counter
from wordcloud import WordCloud
from ast import literal_eval
from textblob import TextBlob #sentiment
from PIL import Image

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

# Libraries for predictive models
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

In [None]:
# Import file
wine = pd.read_csv('winemag-data-130k-v2.csv')

### Dataset Information

In [None]:
print(f"---Dataset Info---")
#printing column names
print(f"Total columns: {len(wine.columns)}")
print("Columns names:", end=" ")
for col in wine:
    if col == 'winery':
        print(col, end=".")
    else: 
        print(col, end=", ")
print()

print(f"Columns type:")
#creating temp array
columnData = []
wineIndexType = []
for col in wine.columns:
    temp = []
    wineIndexType.append(col)
    temp.append(wine[col].apply(type).unique())
    temp.append(wine[col].isnull().sum())
    columnData.append(temp)

wineColumnsType = pd.DataFrame(columnData, columns=['Types','NaN Count'])
wineColumnsType.index = wineIndexType
display(wineColumnsType)

print(f"Dataframe rows: {len(wine)}")

# display dataset
print("Dataset samples:")
wine.sample(5)

#### Column names Explaination:
- __country__: Country of origin
- __description__: Sommelier's description on the wine
- __designation__: Vineyard where the grapes are from
- __points__: Rating on a scale of 1-100 (only scores >=80)
- __price__: Price of wine
- __province__: Province or state that the wine is from
- __region_1__: Wine growing area in a province or state (ie Napa)
- __region_2__: Specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), sometimes blank
- __taster_name__: Name of the person who tasted and reviewed the wine
- __taster_twitter_handle__: Twitter handle for the person who tasted and reviewed the wine
- __title__: Title of the wine review
- __variety__: Type of grapes used to make the wine (ie Pinot Noir)
- __winery__: Winery that made the wine

### Analysis

#### Analysis by Countries and Continents

In [None]:
# Checking countries
country_counts = wine['country'].value_counts()
print(country_counts)

In [None]:
# Plot by Countries
plt.figure(figsize=(10,6))  
sns.barplot(x=country_counts.index, y=country_counts.values, alpha=0.8)

plt.title('Countries of Distribution')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Country', fontsize=12)
plt.xticks(rotation=70) 
plt.show()

This Countries of Distribution diagram demonstrates the number of different wines eaxh country produces in this database
The United States has the most wine brands, almost as much as other countries combined
As a result, We can see that the mareket of wine, especially in the United States is already saturated, and there's lack in opportunities.

In [None]:
# Sorting by Continent
Europe = ['Austria', 'Bosnia and Herzegovina','Bulgaria','Croatia','Cyprus','Czech Republic','England', 'France','Germany','Greece','Italy','Luxembourg','Portugal','Hungary', 'Macedonia', 'Moldova', 'Romania', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Switzerland', 'Turkey', 'Ukraine', 'Georgia']
Asia = ['Armenia', 'China','India','Israel','Lebanon' ]
NorthAmerica = ['Canada','US','Mexico']
SouthAmerica = ['Argentina',',Brazil','Chile','Peru','Uruguay'] 
Oceania = ['Australia','New Zealand'] 
Africa = ['South Africa','Morocco','Egypt']

def continentDispacher(row):
    if row['country'] in Europe:
        val = 'Europe'
    elif row['country'] in Asia:
        #val = 'Asia'
        val = 'Other'
    elif row['country'] in NorthAmerica:
        val = 'North America'
    elif row['country'] in SouthAmerica:
        #val = 'South America'
        val = 'Other'
    elif row['country'] in Oceania:
        #val = 'Oceania'
        val = 'Other'
    elif row['country'] in Africa:
        #val = 'Africa'
        val = 'Other'
    else:
        val = 'Other'

    return val

wine['continent'] = wine.apply(continentDispacher,1)

In [None]:
# Pie chart
pieContinent = px.pie(wine, names='continent', title='Wine Productions Across Continents')
pieContinent.update_traces(textposition='inside', textinfo='percent+label')
pieContinent.update(layout_showlegend=False)
pieContinent.show()

From this pie chart, we are able to observe that almost 90% of wine of the 130k different wines are from either North America or Europe.
Europe has slightly more brands comparing to North America

In [None]:
# World map
wineCountry = wine.groupby('country').count().reset_index()
wineCountry = wineCountry[['country','continent']]
wineCountry.columns = ['country','count']

fig = px.choropleth(wineCountry, locations="country", locationmode='country names', color="count", hover_name="country", color_continuous_scale=px.colors.sequential.Reds)
fig.update_geos(projection_type="natural earth")
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0},title = 'Wine distribution across countries')
fig.show()

As this world map shows the density of wine origin.
The US has time more brands of wine than any other countries on any continent

In [None]:
# world wide wine distribution
wineRegion = wine.groupby(['continent','country','region_1'], dropna=False).count().reset_index()
wineRegion = wineRegion[['continent','country','region_1','points']]
wineRegion.columns = ['continent','country','region_1','count']
wineRegion = wineRegion.dropna(subset=['region_1'])

fig = px.treemap(wineRegion, path=["continent", 'country', 'region_1'],branchvalues="total", values='count', title='Wine distribution across countries')
fig.show()

This interactive diagram shows the region where different wine come from.

We are able to observe that Napa Valley is the source of the most brands in the US
As a result, despite of the constant marketing of Napa Valley wine brands in China, we should think twice, that brands from Napa Valley does not necessarily indicate high quality due to the mass production.

On the other hand, most wine from Europe are from France and Italy.

#### Vintage (Production Year) Analysis

In [None]:
# Extracting the year from title
wine['vintage'] = wine['title'].str.extract('(\d{4})')
wine.head()

In [None]:
wineVintageWithoutNaN = wine.copy()
wineVintageWithoutNaN['vintage'] = pd.to_numeric(wineVintageWithoutNaN['vintage'], errors='coerce')
wineVintageWithoutNaN = wineVintageWithoutNaN[(wineVintageWithoutNaN['vintage'] >= 1950) & (wineVintageWithoutNaN['vintage'] <= datetime.datetime.now().year)]
wineVintageWithoutNaN = wineVintageWithoutNaN[wineVintageWithoutNaN['vintage'] < datetime.datetime.now().year] 
vintageDistribution = px.histogram(wineVintageWithoutNaN, x="vintage", title='Vintage review distribution')

vintageDistribution.update_xaxes(title='Year',dtick=1)
vintageDistribution.update_yaxes(title='Count')
vintageDistribution.show()

This diagram shows the year of 130,000+ brands of wine are produced, we are able to observe most of the wine are produced between 2010 to 2014, with the most being wine from 2013

#### Wine Score Analysis

Scores (points) are from 0 to 100, but the dataset excluded those lower than 80
- __80–84: Good__
- __85–89: Great__
- __90–94: Excellent__
- __95–99: Outstanding__
- __100: Impeccable__

In [None]:
# assign point
def assign_point_description(point):
    if point <= 84:
        return 'Good'
    elif point <= 89:
        return 'Great'
    elif point <= 94:
        return 'Excellent'
    elif point <= 99:
        return 'Outstanding'
    else:
        return 'Impeccable'

wine['pointsDescription'] = wine['points'].apply(assign_point_description)

In [None]:
#Histogram of points
pointDistribution = px.histogram(wine, x='points', color='pointsDescription', title='Points distribution', height=500,
 category_orders=dict(pointsDescription=['Impeccable', 'Outstanding', 'Excellent', 'Great', 'Good']), 
                  labels={
                     "pointsDescription": "Point Description"
                 },
                 color_discrete_map = {'Impeccable':'#57e32c','Outstanding':'#b7dd29','Excellent':'#ffe234','Great':'#ffa534', 'Good':'#ff4545'}

)

# update axis
pointDistribution.update_xaxes(title='Point',tickmode='linear')
pointDistribution.update_yaxes(title='Count')
#display histogram
pointDistribution.show()

We are able to observe that most of the wine in this list received 87,88 and 90 as their score. Wines that received 'Good' grade are more than those scored 'Outstanding' and 'Impeccable' combined. 

#### Wine Price Analysis

Divide price of wine into 5 catagories:
- __<=10   usd: Adequate__ 
- __11–50  usd: Casual__ 
- __51–100 usd: Premium__
- __101–200  usd: Luxury__
- __201<=   usd: Exemplary__ 

In [None]:
# Define price ranges
AdequateOffset = 10
CasualOffset = 50
PremiumOffset = 100
LuxuryOffset = 200

# Assign price
def priceDispacher(price):
    if price <= AdequateOffset:
        return'Adequate'
    elif price <= CasualOffset:
        return'Casual'
    elif price <= PremiumOffset:
        return'Premium'
    elif price <= LuxuryOffset:
        return'Luxury'
    else:
        return'Exemplary'

wine['Description'] = wine['price'].map(priceDispacher)

In [None]:
# Box plot
boxPricePoint = go.Figure()
boxPricePoint.add_trace(go.Box(x=wine['points'], y=wine['price'], orientation='v',marker_color='#722F37', boxmean=True))
boxPricePoint.update_layout(xaxis_range=[79.5, 100.5], title='Correlation between Price and Score of Wine')
boxPricePoint.update_xaxes(title='Point of Wine', dtick=1)
boxPricePoint.update_yaxes(title='Price of Wine (USD)',type="log")
boxPricePoint.update_yaxes()

# display box plot
boxPricePoint.show()

From this diagram, we are able to observe that although more expensive wine doe not mean a higher score, but according to the trend, it is more likely for a more expensive wine to receive a higher score. More studies are required before understanding if this is due to the high price point or simply the taste.

In [None]:
# Stacked histogram
averagepricePoint = px.histogram(wine,x='points', color='Description', barmode='stack', barnorm='percent',
 category_orders=dict(priceDescription=['Adequate', 'Casual', 'Premium', 'Luxury', 'Exemplary']), title='Price distribution by Score', labels={
                     "Description": "Price Description"
                 }, color_discrete_sequence=px.colors.sequential.Burg
                 )
# update axis
averagepricePoint.update_xaxes(title='Point', dtick=1)
averagepricePoint.update_yaxes(title='Count %')

#display stacked histogram
averagepricePoint.show()

This diagram shows the percentages where wines from each price point scored in the rating process. 

We can observe that 78.9% of wines that are $200+ have a rating of 100

Wines in the Casual price range mostl likely receive points between 86 to 90, which is not bad considering the price

Luyxury wines are more likely to receive scores between 96 to 98 which suits the price

And lastly, cheaper Adequate wine rarely would score higher than 87

#### Text Analysis

In [None]:
print(wine.columns.values)

In [None]:
wine.head()

In [None]:
wine.describe()

In [None]:
wine['taster_name'].value_counts()

In [None]:
# Amount of reviews of each wine taster
plt.figure(figsize=(10,15))
sns.countplot(y='taster_name', data=wine, order=wine.taster_name.value_counts().index)
plt.show()

This diagram shows the amount of reviews each sommelier made, Roger Voss has tested and wrote the most reviews among other sommeliers.

In [None]:
wine['numwords'] = wine['description'].map(lambda x:len(re.findall(r'\w+', x)))

In [None]:
wordsbychar = wine.groupby('taster_name', as_index=False).numwords.sum()
wordsbychar

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(x='numwords', y='taster_name', data=wordsbychar, order=wordsbychar.sort_values('numwords').taster_name[0:20], orient='h')
plt.show()

In [None]:
# Lowercase conversion, HTML tag removal, URL removal, digit removal, 
# tokenization, stopword removal, stemming, and lemmatization.
def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence = sentence.replace('{html}',"") 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url = re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)  
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words = [PorterStemmer().stem(w) for w in filtered_words]
    lemma_words=[WordNetLemmatizer().lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)

In [None]:
import nltk
nltk.download('omw-1.4')

In [None]:
# adding this processed function to our text in a new column
wine['clean'] = wine['description'].map(lambda x: preprocess(x))

In [None]:
# Top 10 words
topwords = Counter("".join(wine['clean']).split()).most_common(10)
topwords

In [None]:
# Sommeliers wCloud
sommeliers_wordcloud = WordCloud(background_color='white', max_words=100, colormap='copper').generate_from_frequencies(dict(topwords))
plt.figure(figsize=(10, 10))
plt.imshow(sommeliers_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

This is the most common used 10 words in all the reviews.

In [None]:
wine['taster_name'].unique()

In [None]:
voss = wine[wine['taster_name']=='Roger Voss']

In [None]:
count_voss = Counter(" ".join(voss["clean"]).split()).most_common(10)
count_voss

In [None]:
# word cloud (Roger Voss)
voss_wordcloud = WordCloud(background_color='white', max_words=100, colormap='copper').generate_from_frequencies(dict(count_voss))
plt.figure(figsize=(10, 10))
plt.imshow(voss_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

This word cloud shows the words Roger Voss used the most in his reviews on different wines. We are able to see that he has expereince with fruity wine, and specifies on the acidity on wines.

In [None]:
schachner = wine[wine['taster_name']=='Michael Schachner']

In [None]:
count_schachner = Counter(" ".join(schachner["clean"]).split()).most_common(10)
count_schachner

In [None]:
# word cloud (Michael Schachner)
schachner_wordcloud = WordCloud(background_color='white', max_words=100, colormap='copper').generate_from_frequencies(dict(count_schachner))
plt.figure(figsize=(10, 10))
plt.imshow(schachner_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

Michael Schachner focuses the reviews on the flavors of the wine, he talks about the aromas, and how the finishing taste of the wine are.

### Sentiment Analysis

In [None]:
# Create a function to get the polarity
def get_polarity(text):
    return TextBlob(text).sentiment.polarity

# Create a new column 'Polarity' by applying the function to the 'description' column
wine['Polarity'] = wine['description'].apply(get_polarity)

In [None]:
wine.head()

In [None]:
fig = px.histogram(wine, x="Polarity")
fig.show()

In [None]:
fig = px.histogram(wine, x="Polarity", nbins=50, color_discrete_sequence=['#722F37'])
fig.show()

This is a diagram showing the polarity made by sommeliers, most of the reviews seems to be rather positive(>0), but there are still negative reviews.

In [None]:
fig = px.scatter(wine, x="points", y="Polarity", color_discrete_sequence=['#722F37'])
fig.show()

From this diagram we are able to observe that the lower score the wine receives, the more neutral the polarity. The higher the score of wine, the more possitive reviews it gets.

In [None]:
# scatter plot of sentiment polarity vs price
fig = px.scatter(wine, x="price", y="Polarity", log_x=True, hover_data=['title'], color_discrete_sequence=['#722F37'])
fig.update_layout(title='Sentiment Polarity vs Wine Price', 
                  xaxis=dict(title='Price (log scale)'), 
                  yaxis=dict(title='Sentiment Polarity'))
fig.show()

We can observe that there is not quite a correlation between price and polarity, but when you reach the very expensive wines, it is less likely that it has a low sentiment score. 

### Conclusion

- __Geographical Distribution:__ The analysis reveals that the majority of wines come from the United States and Europe, with the U.S. holding the largest share. This suggests that the market, particularly in the U.S., is highly saturated, indicating a potentially competitive environment for new brands or types of wine. It also points to opportunities in under-represented regions such as South America, Asia, and Africa.

- __Vintage:__ Most of the wines included in the dataset were produced between 2010 and 2014, indicating a relatively young age for the majority of wines being rated. Wines from these years are likely still readily available in the market.

- __Ratings:__ Ratings are generally positive, with many wines scoring around 87 to 90 points. The analysis shows that ratings are somewhat linked to price, with more expensive wines being more likely to receive higher scores. However, there are many instances where lower-priced wines receive high scores, suggesting that price is not always an indicator of quality or enjoyment.

- __Sentiment Analysis:__ The sentiment analysis of the reviews indicates a generally positive sentiment among reviewers, even for wines that receive lower scores. Interestingly, very expensive wines are less likely to have a low sentiment score, indicating a correlation between price and positive sentiment.

- __Review Content:__ Reviewers tend to focus on the flavor, aroma, and acidity of the wines. Words related to these attributes feature prominently in the reviews, indicating the importance of these aspects in the wine tasting and rating process.

In conclusion, the analysis suggests that while the wine market is heavily dominated by the U.S. and Europe, there are opportunities for wines from other regions. Furthermore, while price can be an indicator of quality, it's not a guarantee, and a well-made, lower-priced wine can still receive high scores and positive reviews. Reviewers focus on the flavor, aroma, and acidity of wines, suggesting these are important factors for winemakers to consider.