# Wine Classifications by Reviews

There's so many wines out there today, and often times a trip to the grocery store to pick up a quick drink to go with dinner can take longer than expected. Even when you know the type of wine you like, there's so many different wineries, and let's not even start on the years. Even wanting to try something new can be daunting, what if you spend $15 dollars just to find out you can't stand more than one sip?! 

Luckily, there's sommeliers in the world... unluckily, they probably aren't hanging around your local grocery or liquor store. Classification beyond the wine types could prove valuable in this case! Using machine learning and hefty data source of sommelier's reviews, I plan to use **LSA** text classification to find groups of wine, possibly with more in common than just being the same type!

## Text Preparation

Before we can get started with topic creation, we need to cleanse the text we will be using to do this. This process has been adapted from Joshi (2018).

In [4]:
import pandas as pd

In [5]:
wine=pd.read_csv("D:\Practicum2\data\Wine.csv")

In [7]:
wine.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,variety,winery
0,0,United States of America,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez
2,2,United States of America,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley
3,3,United States of America,"This spent 20 months in 30% new French oak, an...",Reserve,96,65,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66,Provence,Bandol,,Provence red blend,Domaine de la Bégude


We start by removing punctuation and numbers, as we are only after the words here. 

In [8]:
wine['clean_desc']=wine['description'].str.replace("[^a-zA-Z#]", " ")

In [9]:
wine.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,variety,winery,clean_desc
0,0,United States of America,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,This tremendous varietal wine hails from ...
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,Ripe aromas of fig blackberry and cassis are ...
2,2,United States of America,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,Mac Watson honors the memory of a wine once ma...
3,3,United States of America,"This spent 20 months in 30% new French oak, an...",Reserve,96,65,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,This spent months in new French oak an...
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66,Provence,Bandol,,Provence red blend,Domaine de la Bégude,This is the top wine from La B gude named aft...


We can head off some accidental classifications based on words that aren't so meaningful by removing the words that are less than 3 characters as well.

In [10]:
wine['clean_desc'] = wine['clean_desc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))

In [11]:
wine.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,variety,winery,clean_desc
0,0,United States of America,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,This tremendous varietal wine hails from Oakvi...
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,Ripe aromas blackberry cassis softened sweeten...
2,2,United States of America,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,Watson honors memory wine once made mother thi...
3,3,United States of America,"This spent 20 months in 30% new French oak, an...",Reserve,96,65,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,This spent months French incorporates fruit fr...
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66,Provence,Bandol,,Provence red blend,Domaine de la Bégude,This wine from gude named after highest point ...


And we will make everything lowercase so the case doesn't effect the later vectorization (such as Ripe and ripe being considered different when we want them treated as the same).

In [12]:
wine['clean_desc'] = wine['clean_desc'].apply(lambda x: x.lower())

In [13]:
wine.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,variety,winery,clean_desc
0,0,United States of America,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,this tremendous varietal wine hails from oakvi...
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,ripe aromas blackberry cassis softened sweeten...
2,2,United States of America,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,watson honors memory wine once made mother thi...
3,3,United States of America,"This spent 20 months in 30% new French oak, an...",Reserve,96,65,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,this spent months french incorporates fruit fr...
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66,Provence,Bandol,,Provence red blend,Domaine de la Bégude,this wine from gude named after highest point ...


My first attempt at LSA showed me that oftentimes the common words in the topics were actually the names of the wine varieties themselves, so we will remove those now as we want to use the review itself to group and find commanalities, not the wine types. 

I created my my own wine name word list from Wine Varietals A-Z. (n.d.). Nothing fancy went into this beyond copy and paste!

In [14]:
ww=['albariño','aligoté','amarone','arneis','asti','auslese','banylus','barbaresco','bardolino','barolo','beaujolais','blanc','blanc','blush','boal','brunello','cabernet','cabernet','carignan','carmenere','cava','charbono','champagne','chardonnay','châteauneuf-du-pape','chenin','chianti','chianti','claret','colombard','constantia','cortese','dolcetto','eiswein','frascati','fumé','gamay','gamay','gattinara','gewürztraminer','grappa','grenache','johannisberg','kir','lambrusco','liebfraumilch','madeira','malbec','marc','marsala','marsanne','mead','meritage','merlot','montepulciano','moscato','mourvedre','müller-thurgau','muscat','nebbiolo','petit','petite','pinot','pinot','pinot','pinot','pinotage','port','retsina','rosé','roussanne','sangiovese','sauterns','sauvignon','sémillon','sherry','soave','tokay','traminer','trebbiano','ugni','valpolicella','verdicchio','viognier','zinfandel','spumante','franc','sauvignon','blanc','classico','beaujolais','riesling','verdot','sirah','grigio/pinot','meunier','noir','blancs','noirs','bual','colombard','gris','syrah']

In [15]:
ww

['albariño',
 'aligoté',
 'amarone',
 'arneis',
 'asti',
 'auslese',
 'banylus',
 'barbaresco',
 'bardolino',
 'barolo',
 'beaujolais',
 'blanc',
 'blanc',
 'blush',
 'boal',
 'brunello',
 'cabernet',
 'cabernet',
 'carignan',
 'carmenere',
 'cava',
 'charbono',
 'champagne',
 'chardonnay',
 'châteauneuf-du-pape',
 'chenin',
 'chianti',
 'chianti',
 'claret',
 'colombard',
 'constantia',
 'cortese',
 'dolcetto',
 'eiswein',
 'frascati',
 'fumé',
 'gamay',
 'gamay',
 'gattinara',
 'gewürztraminer',
 'grappa',
 'grenache',
 'johannisberg',
 'kir',
 'lambrusco',
 'liebfraumilch',
 'madeira',
 'malbec',
 'marc',
 'marsala',
 'marsanne',
 'mead',
 'meritage',
 'merlot',
 'montepulciano',
 'moscato',
 'mourvedre',
 'müller-thurgau',
 'muscat',
 'nebbiolo',
 'petit',
 'petite',
 'pinot',
 'pinot',
 'pinot',
 'pinot',
 'pinotage',
 'port',
 'retsina',
 'rosé',
 'roussanne',
 'sangiovese',
 'sauterns',
 'sauvignon',
 'sémillon',
 'sherry',
 'soave',
 'tokay',
 'traminer',
 'trebbiano',
 'ugni',

There's some other words that keep cropping up that don't add too much to the topic either, so we will add those to this list. 

In [16]:
ww.extend(['flavors','flavor','like','wine','character','finish','good','notes','drink','finish','bodied','body','blend'])

In [17]:
#Code adapted from Foley (2015). 
wine['clean_desc']= wine['clean_desc'].apply(lambda x: ' '.join([word for word in x.split() if word not in (ww)]))

In [18]:
wine.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,variety,winery,clean_desc
0,0,United States of America,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,this tremendous varietal hails from oakville a...
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,ripe aromas blackberry cassis softened sweeten...
2,2,United States of America,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,watson honors memory once made mother this tre...
3,3,United States of America,"This spent 20 months in 30% new French oak, an...",Reserve,96,65,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,this spent months french incorporates fruit fr...
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66,Provence,Bandol,,Provence red blend,Domaine de la Bégude,this from gude named after highest point viney...


In [19]:
wine.to_csv(r'D:\Practicum2\data\WineClean.csv', index = False)