# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_:

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [1]:
import pandas as pd
import numpy as np 
import requests
import re
from bs4 import BeautifulSoup
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
import matplotlib as plt
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
#pd.apply(lambda row : score(row['text_col']))

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/corina/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [2]:
df = pd.read_csv("wine-reviews.csv")



In [3]:
df

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review]
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review]
5,https://www.winemag.com/buying-guide/mumm-napa...,90.0,Mumm Napa 2008 DVX Rosé Sparkling (Napa Valley),"Pretty peach in color, this 50-50 sparkling bl...",Virginie Boone,"$70, Buy Now",DVX Rosé,"Sparkling Blend, Sparkling","Napa Valley, Napa, California, US",Mumm Napa,12.5%,750 ml,Sparkling,,12/1/2014,Not rated yet [Add Your Review]
6,https://www.winemag.com/buying-guide/nuiton-be...,90.0,Nuiton-Beaunoy 2011 Clos du Chapitre Premier C...,The two-acre Clos du Chapitre vineyard is in t...,Roger Voss,"N/A, Buy Now",Clos du Chapitre Premier Cru,Pinot Noir,"Gevrey-Chambertin, Burgundy, France",Nuiton-Beaunoy,13%,750 ml,Red,"Fruit of the Vines, Inc",12/1/2014,Not rated yet [Add Your Review]
7,https://www.winemag.com/buying-guide/trapiche-...,90.0,Trapiche 2012 Broquel Cabernet Sauvignon (Mend...,"Spice, licorice and herbal notes complement re...",Michael Schachner,"$15, Buy Now",Broquel,Cabernet Sauvignon,"Mendoza, Mendoza Province, Argentina",Trapiche,14%,750 ml,Red,The Wine Group,12/1/2014,Not rated yet [Add Your Review]
8,https://www.winemag.com/buying-guide/zonin-201...,90.0,Zonin 2010 Amarone della Valpolicella,"Full-bodied and fresh, this offfers attractive...",Kerin O’Keefe,"$50, Buy Now",,"Red Blends, Red Blends","Amarone della Valpolicella, Veneto, Italy",Zonin,15%,750 ml,Red,Zonin USA,12/1/2014,Not rated yet [Add Your Review]
9,https://www.winemag.com/buying-guide/pali-2012...,90.0,Pali 2012 Cargasacchi Vineyard Pinot Noir (Sta...,"Round, savory aromas of orange-cranberry with ...",Matt Kettmann,"$56, Buy Now",Cargasacchi Vineyard,Pinot Noir,"Sta. Rita Hills, Central Coast, California, US",Pali,13.8%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]


In [4]:
df['price'] = df.price.str.extract("(\d+)", expand=False).dropna().astype(int)
df['year_produced'] = df.wine_name.str.extract("(\d\d\d\d)", expand=False).dropna().astype(int)
df['alcohol'] = df.alcohol.str.replace("%", "").dropna().astype(float)
df['country'] = df.appellation.str.split(", ").str.get(-1)
df['user avg rating'].str.split(" ").str.get(0).replace("Not", np.nan).value_counts()
df['bottle size'] = df['bottle size'].str.replace(" ","") \
                                     .str.replace("ml", "", case=False) \
                                     .str.replace("l","000", case=False) \
                                     .str.replace(".50", "5", regex=False) \
                                     .astype(int)
df.head(2)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,year_produced,country
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,25.0,Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5,750,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],2011.0,Spain
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,65.0,Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5,750,White,,12/1/2014,Not rated yet [Add Your Review],2012.0,US


In [5]:
df = df[df.alcohol < 25]

In [6]:
df = df[df.year_produced > 1960]

## Filtering out dirty data

Now that you've extracted some columns and converted others to numbers, you probably want to clean clean clean any thing you're using in your analysis. There are some 95% wines in there which I don't think are exactly real, and you probably don't care about wines supposedly made in 1070AD.

In [7]:
pd.set_option('max_colwidth', 100)

In [8]:
df.shape

(17597, 18)

In [9]:
df.shape

(17597, 18)

In [10]:
df['wine_desc'].isna().value_counts()

False    17516
True        81
Name: wine_desc, dtype: int64

In [17]:
df['wine_desc'] = df['wine_desc'].dropna()

In [18]:
df['wine_desc'] = df['wine_desc'].dropna(axis=0)

In [19]:
df['wine_desc'].shape

(17597,)

In [14]:
df['wine_desc'].isna().unique()

array([False])

## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [15]:
sns.jointplot(data=df, x='year_produced', y='alcohol', kind="reg")

NameError: name 'sns' is not defined

## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - How does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

In [None]:
#category, volume, or variety is more important when it comes to rating wines?

In [None]:
#what are the most frequent words to describe a wine?what is the sentiment behind each review?

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

In [None]:
#frequency

In [21]:
df['polarity'] = df['wine_desc'].apply(lambda row: TextBlob(row).sentiment.polarity)

In [22]:
def polarity(row):
    if row > 0.0:
        return 1
    if row < 0.0:
        return 0

In [23]:
df['polarity'] = df['polarity'].apply(polarity)

In [24]:
df.polarity.value_counts()

1.0    14468
0.0     2759
Name: polarity, dtype: int64

In [45]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=False, norm='l1', max_features=2000)
X = tfidf_vectorizer.fit_transform(df['wine_desc'])
words_df = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())

In [46]:
words_df= words_df.drop(['nan'], axis=1)

In [48]:
words_df.sum().sort_values(ascending=False).head(50)

wine          529.713536
flavors       448.953712
fruit         308.148446
palate        217.296487
aromas        214.867509
finish        213.341090
acidity       211.908636
cherry        182.893460
tannins       172.816973
drink         163.984798
ripe          159.846712
black         152.110750
red           121.632070
rich          121.465961
dry           119.300014
sweet         119.008394
spice         109.709528
oak           108.743559
nose          108.383943
notes         107.816336
fresh         100.234354
berry          97.421415
soft           92.907282
blackberry     84.290401
plum           82.962278
fruits         80.386476
crisp          77.254751
light          74.756241
white          74.514883
texture        74.322468
citrus         74.099934
apple          73.573615
good           72.601289
vanilla        72.369400
dark           72.030695
green          71.863107
shows          71.817356
blend          70.361134
bodied         68.665420
years          65.805480


In [49]:
X = words_df
y = df.polarity

In [50]:
df.polarity = df.polarity.dropna()

In [51]:
pd.set_option('display.max_columns', 50)

In [52]:
words_df.head()

Unnamed: 0,000,03,04,05,10,100,11,12,13,14,15,16,17,18,20,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,...,work,works,world,worth,worthy,wound,woven,wrapped,year,years,yeast,yeasty,yellow,yes,yielding,young,youth,youthful,zest,zestiness,zesty,zin,zinfandel,zingy,zippy
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
