# Wine Classifications by Reviews

There's so many wines out there today, and often times a trip to the grocery store to pick up a quick drink to go with dinner can take longer than expected. Even when you know the type of wine you like, there's so many different wineries, and let's not even start on the years. Even wanting to try something new can be daunting, what if you spend $15 dollars just to find out you can't stand more than one sip?! 

Luckily, there's sommeliers in the world... unluckily, they probably aren't hanging around your local grocery or liquor store. Classification beyond the wine types could prove valuable in this case! Using machine learning and hefty data source of sommelier's reviews, I plan to use **LSA** text classification to find groups of wine, possibly with more in common than just being the same type!

Time to start vectorizing!

In [1]:
import pandas as pd

In [3]:
wine=pd.read_csv("D:\Practicum2\data\WineClean.csv")

In [5]:
wine.head()

Unnamed: 0,id,country,description,designation,points,price,province,region_1,region_2,variety,winery,clean_desc
0,0,United States of America,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,this tremendous varietal hails from oakville a...
1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,ripe aromas blackberry cassis softened sweeten...
2,2,United States of America,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,watson honors memory once made mother this tre...
3,3,United States of America,"This spent 20 months in 30% new French oak, an...",Reserve,96,65,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,this spent months french incorporates fruit fr...
4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66,Provence,Bandol,,Provence red blend,Domaine de la Bégude,this from gude named after highest point viney...


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', 
max_features= 1000, # keep top 1000 terms, cause that's a lot of words otherwise 
max_df = 0.5, 
smooth_idf=True)

In [7]:
X = vectorizer.fit_transform(wine['clean_desc'])

In [8]:
X.shape # check shape of the document-term matrix

(97850, 1000)

In [9]:
from sklearn.decomposition import TruncatedSVD

how many groups do we want? damn that's a lot of different varieties

In [10]:
wine.agg({'variety':pd.Series.nunique})

variety    632
dtype: int64

that's not super helpful. lets just try a bunch and see what happens. 

In [11]:
svd_model10 = TruncatedSVD(n_components=10, algorithm='randomized', n_iter=100, random_state=122)

In [12]:
svd_model25 = TruncatedSVD(n_components=25, algorithm='randomized', n_iter=100, random_state=122)

In [13]:
svd_model50 = TruncatedSVD(n_components=50, algorithm='randomized', n_iter=100, random_state=122)

In [14]:
svd_model10.fit(X)

TruncatedSVD(algorithm='randomized', n_components=10, n_iter=100,
             random_state=122, tol=0.0)

In [15]:
svd_model25.fit(X)

TruncatedSVD(algorithm='randomized', n_components=25, n_iter=100,
             random_state=122, tol=0.0)

In [16]:
svd_model50.fit(X)

TruncatedSVD(algorithm='randomized', n_components=50, n_iter=100,
             random_state=122, tol=0.0)

In [17]:
terms = vectorizer.get_feature_names()

In [18]:
for i, comp in enumerate(svd_model10.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])
        print(" ")

Topic 0: 
fruit
 
tannins
 
acidity
 
cherry
 
aromas
 
black
 
ripe
 
Topic 1: 
apple
 
citrus
 
crisp
 
fresh
 
white
 
peach
 
acidity
 
Topic 2: 
acidity
 
fruits
 
ripe
 
rich
 
wood
 
tannins
 
aging
 
Topic 3: 
sweet
 
soft
 
cherry
 
simple
 
cherries
 
vanilla
 
little
 
Topic 4: 
fresh
 
cherry
 
light
 
fruity
 
soft
 
acidity
 
bright
 
Topic 5: 
soft
 
sweet
 
ripe
 
berry
 
fruits
 
plum
 
wood
 
Topic 6: 
green
 
herbal
 
palate
 
plum
 
sweet
 
apple
 
nose
 
Topic 7: 
fruit
 
light
 
herbal
 
tart
 
berry
 
raspberry
 
barrel
 
Topic 8: 
soft
 
fruit
 
black
 
apple
 
light
 
tannins
 
ripe
 
Topic 9: 
sweet
 
black
 
cherry
 
acidity
 
tannins
 
firm
 
white
 


In [19]:
for i, comp in enumerate(svd_model25.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])
        print(" ")

Topic 0: 
fruit
 
tannins
 
acidity
 
cherry
 
aromas
 
black
 
ripe
 
Topic 1: 
apple
 
citrus
 
crisp
 
fresh
 
white
 
peach
 
acidity
 
Topic 2: 
acidity
 
fruits
 
ripe
 
rich
 
wood
 
tannins
 
aging
 
Topic 3: 
sweet
 
soft
 
cherry
 
simple
 
cherries
 
vanilla
 
little
 
Topic 4: 
fresh
 
cherry
 
light
 
fruity
 
soft
 
acidity
 
bright
 
Topic 5: 
soft
 
sweet
 
ripe
 
berry
 
fruits
 
plum
 
wood
 
Topic 6: 
green
 
herbal
 
palate
 
plum
 
sweet
 
apple
 
nose
 
Topic 7: 
fruit
 
light
 
herbal
 
tart
 
berry
 
raspberry
 
barrel
 
Topic 8: 
soft
 
fruit
 
black
 
apple
 
light
 
tannins
 
ripe
 
Topic 9: 
sweet
 
black
 
cherry
 
acidity
 
tannins
 
firm
 
white
 
Topic 10: 
cherry
 
ripe
 
raspberry
 
vanilla
 
acidity
 
plum
 
palate
 
Topic 11: 
blackberry
 
green
 
chocolate
 
dark
 
apple
 
cherry
 
citrus
 
Topic 12: 
tannins
 
firm
 
apple
 
cherry
 
green
 
soft
 
years
 
Topic 13: 
ripe
 
white
 
blackberry
 
berry
 
acidity
 
fruit
 
tannic
 
Topic 14: 
tannins


In [20]:
for i, comp in enumerate(svd_model50.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])
        print(" ")

Topic 0: 
fruit
 
tannins
 
acidity
 
cherry
 
aromas
 
black
 
ripe
 
Topic 1: 
apple
 
citrus
 
crisp
 
fresh
 
white
 
peach
 
acidity
 
Topic 2: 
acidity
 
fruits
 
ripe
 
rich
 
wood
 
tannins
 
aging
 
Topic 3: 
sweet
 
soft
 
cherry
 
simple
 
cherries
 
vanilla
 
little
 
Topic 4: 
fresh
 
cherry
 
light
 
fruity
 
soft
 
acidity
 
bright
 
Topic 5: 
soft
 
sweet
 
ripe
 
berry
 
fruits
 
plum
 
wood
 
Topic 6: 
green
 
herbal
 
palate
 
plum
 
sweet
 
apple
 
nose
 
Topic 7: 
fruit
 
light
 
herbal
 
tart
 
berry
 
raspberry
 
barrel
 
Topic 8: 
soft
 
fruit
 
black
 
apple
 
light
 
tannins
 
ripe
 
Topic 9: 
sweet
 
black
 
cherry
 
acidity
 
tannins
 
firm
 
white
 
Topic 10: 
cherry
 
ripe
 
raspberry
 
vanilla
 
acidity
 
plum
 
palate
 
Topic 11: 
blackberry
 
green
 
chocolate
 
dark
 
apple
 
cherry
 
citrus
 
Topic 12: 
tannins
 
firm
 
apple
 
cherry
 
green
 
soft
 
years
 
Topic 13: 
ripe
 
white
 
blackberry
 
berry
 
acidity
 
fruit
 
tannic
 
Topic 14: 
tannins


In [None]:
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
#https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py