# Machine learning on wine

**Topics:** Text analysis, linear regression, logistic regression, text analysis, classification

**Datasets**

- **wine-reviews.csv** Wine reviews scraped from https://www.winemag.com/
- **Data dictionary:** just go [here](https://www.winemag.com/buying-guide/tenuta-dellornellaia-2007-masseto-merlot-toscana/) and look at the page

## The background

You work in the **worst newsroom in the world**, and you've had a hard few weeks at work - a couple stories killed, a few scoops stolen out from under you. It's not going well.

And because things just can't get any worse: your boss shows up, carrying a huge binder. She slams it down on your desk.

"You know some machine learning stuff, right?"

You say "no," but she isn't listening. She's giving you an assignment, the _worst assignment_...

> Machine learning is the new maps. Let's get some hits!
>
> **Do some machine learning on this stuff.**

"This stuff" is wine reviews.

## A tiny, meagre bit of help

You have a dataset. It has some stuff in it:

* **Numbers:**
    - Year published
    - Alcohol percentage
    - Price
    - Score
    - Bottle size
* **Categories:**
    - Red vs white
    - Different countries
    - Importer
    - Designation
    - Taster
    - Variety
    - Winery
* **Free text:**
    - Wine description

# Cleaning up your data

Many of these pieces - the alcohol, the year produced, the bottle size, the country the wine is from - aren't in a format you can use. Convert the ones to numbers that are numbers, and extract the others from the appropriate strings.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200)

In [2]:
df = pd.read_csv("wine-reviews.csv")
df.head(5)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,"$25, Buy Now",Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5%,750 ml,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review]
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,"$65, Buy Now",Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,"$25, Buy Now",Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5%,750 ml,White,,12/1/2014,Not rated yet [Add Your Review]
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,"$65, Buy Now",No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13%,750 ml,Red,,12/1/2014,Not rated yet [Add Your Review]
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,"$17, Buy Now",,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13%,750 ml,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review]


In [3]:
df.shape

(42295, 16)

In [4]:
# cleaning price column
df.price = df.price.str.replace("$", "")

In [5]:
df.price = df.price.str.replace("Buy Now", "")

In [6]:
df.price = df.price.replace(',','', regex=True)

In [7]:
# cleaning alcohol column
df.alcohol = df.alcohol.replace('%','', regex=True)

In [8]:
# cleaning bottle size column
df['bottle size'] = df['bottle size'].str.replace('ml','')

In [9]:
# cleaning country column
df['country']= df.appellation.str.rsplit(',').str[-1]

In [10]:
# cleaning year column
df['year'] = df['date published'].str[-4:]

In [11]:
df['year'] = df['year'].astype(float)

In [12]:
df['year'] = 2021 - df['year']

In [49]:
df.isna().sum()

url                0
wine_points        0
wine_name          0
wine_desc          0
taster             0
price              0
designation        0
variety            0
appellation        0
winery             0
alcohol            0
bottle size        0
category           0
importer           0
date published     0
user avg rating    0
country            0
year               0
dtype: int64

In [14]:
df.head(5)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,country,year
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,25,Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5,750,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],Spain,7.0
1,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2012 Stoller Vineyard Chardonnay (Du...,"A tiny production wine, this is rich, tart and...",Paul Gregutt,65,Stoller Vineyard,Chardonnay,"Dundee Hills, Willamette Valley, Oregon, US",Adelsheim,13.5,750,White,,12/1/2014,Not rated yet [Add Your Review],US,7.0
2,https://www.winemag.com/buying-guide/adelsheim...,90.0,Adelsheim 2013 Ribbon Springs Vineyard Auxerro...,This is another fine vintage for this rare win...,Paul Gregutt,25,Ribbon Springs Vineyard,"Auxerrois, Other White","Ribbon Ridge, Willamette Valley, Oregon, US",Adelsheim,13.5,750,White,,12/1/2014,Not rated yet [Add Your Review],US,7.0
3,https://www.winemag.com/buying-guide/jcb-2011-...,90.0,JCB 2011 No. 11 Pinot Noir (Sonoma Coast),Light in color and lilting floral aromas of ro...,Virginie Boone,65,No. 11,Pinot Noir,"Sonoma Coast, Sonoma, California, US",JCB,13.0,750,Red,,12/1/2014,Not rated yet [Add Your Review],US,7.0
4,https://www.winemag.com/buying-guide/pazo-pond...,90.0,Pazo Pondal 2013 Albariño (Rías Baixas),"Alluring, inviting aromas of white flowers, me...",Michael Schachner,17,,Albariño,"Rías Baixas, Galicia, Spain",Pazo Pondal,13.0,750,White,Vinaio Imports,12/1/2014,Not rated yet [Add Your Review],Spain,7.0


In [15]:

# df.dropna(how='any', inplace=True)

In [16]:
df.shape

(13318, 18)

In [17]:
df.head(10)

Unnamed: 0,url,wine_points,wine_name,wine_desc,taster,price,designation,variety,appellation,winery,alcohol,bottle size,category,importer,date published,user avg rating,country,year
0,https://www.winemag.com/buying-guide/artadi-20...,90.0,Artadi 2011 Viñas de Gain (Rioja),"Inky, minerally aromas of blackberry, black pl...",Michael Schachner,25.0,Viñas de Gain,Tempranillo,"Rioja, Northern Spain, Spain",Artadi,14.5,750,Red,Folio Fine Wine Partners,12/1/2014,Not rated yet [Add Your Review],Spain,7.0
6,https://www.winemag.com/buying-guide/nuiton-be...,90.0,Nuiton-Beaunoy 2011 Clos du Chapitre Premier C...,The two-acre Clos du Chapitre vineyard is in t...,Roger Voss,,Clos du Chapitre Premier Cru,Pinot Noir,"Gevrey-Chambertin, Burgundy, France",Nuiton-Beaunoy,13.0,750,Red,"Fruit of the Vines, Inc",12/1/2014,Not rated yet [Add Your Review],France,7.0
7,https://www.winemag.com/buying-guide/trapiche-...,90.0,Trapiche 2012 Broquel Cabernet Sauvignon (Mend...,"Spice, licorice and herbal notes complement re...",Michael Schachner,15.0,Broquel,Cabernet Sauvignon,"Mendoza, Mendoza Province, Argentina",Trapiche,14.0,750,Red,The Wine Group,12/1/2014,Not rated yet [Add Your Review],Argentina,7.0
10,https://www.winemag.com/buying-guide/peter-nic...,90.0,Peter Nicolay 2012 Erdener Treppchen Feinherb ...,"Honey-kissed peaches waft from this ripe, rich...",Anna Lee C. Iijima,20.0,Erdener Treppchen Feinherb,Riesling,"Mosel, Germany",Peter Nicolay,12.0,750,White,Saranty Imports,12/1/2014,Not rated yet [Add Your Review],Germany,7.0
11,https://www.winemag.com/buying-guide/pol-roger...,90.0,Pol Roger NV Réserve Brut (Champagne),"Full and ripe, this offers balance between ric...",Roger Voss,64.0,Réserve Brut,"Champagne Blend, Sparkling","Champagne, Champagne, France",Pol Roger,12.5,750,Sparkling,"Frederick Wildman & Sons, Ltd",12/1/2014,Not rated yet [Add Your Review],France,7.0
12,https://www.winemag.com/buying-guide/quinta-pa...,90.0,Quinta de Paços 2013 Casa do Capitão-mor Alvar...,Named after the 13th-century house at the cent...,Roger Voss,18.0,Casa do Capitão-mor,"Alvarinho, Albariño","Vinho Verde, Portugal",Quinta de Paços,13.0,750,White,Aidil Wines/Old World Import,12/1/2014,Not rated yet [Add Your Review],Portugal,7.0
13,https://www.winemag.com/buying-guide/quinta-no...,90.0,Quinta do Noval NV 10-Years-Old Tawny (Port),"This dry, balanced aged tawny is both fruity a...",Roger Voss,30.0,10-Years-Old Tawny,"Port, Port Blend","Port, Portugal",Quinta do Noval,20.0,750,Port/Sherry,Vintus LLC,12/1/2014,Not rated yet [Add Your Review],Portugal,7.0
15,https://www.winemag.com/buying-guide/maso-mart...,90.0,Maso Martis 2007 Brut Riserva Sparkling (Trento),"A blend of 70% Pinot Nero and 30% Chardonnay, ...",Kerin O’Keefe,52.0,Brut Riserva,"Sparkling Blend, Sparkling","Trento, Northeastern Italy, Italy",Maso Martis,12.5,750,Sparkling,Solstars,12/1/2014,Not rated yet [Add Your Review],Italy,7.0
16,https://www.winemag.com/buying-guide/maso-mart...,90.0,Maso Martis 2009 Dosaggiozero Riserva Sparklin...,"Made with no dosage, this is a blend of 70% Ch...",Kerin O’Keefe,52.0,Dosaggiozero Riserva,"Sparkling Blend, Sparkling","Trento, Northeastern Italy, Italy",Maso Martis,12.5,750,Sparkling,Solstars,12/1/2014,Not rated yet [Add Your Review],Italy,7.0
17,https://www.winemag.com/buying-guide/maximin-g...,90.0,Maximin Grünhäuser 2012 Herrenberg Kabinett Ri...,"Dried oregano and sage notes lend a dusty, sav...",Anna Lee C. Iijima,34.0,Herrenberg Kabinett,Riesling,"Mosel, Germany",Maximin Grünhäuser,8.0,750,White,Loosen Bros. USA,12/1/2014,Not rated yet [Add Your Review],Germany,7.0


In [18]:
df.dtypes

url                 object
wine_points        float64
wine_name           object
wine_desc           object
taster              object
price               object
designation         object
variety             object
appellation         object
winery              object
alcohol             object
bottle size         object
category            object
importer            object
date published      object
user avg rating     object
country             object
year               float64
dtype: object

## What might be interesting in this dataset?

Maybe start out playing around _without_ machine learning. Here are some thoughts to get you started:

* I've heard that since the 90's wine has gone through [Parkerization](https://www.estatewinebrokers.com/blog/the-parkerization-of-wine-in-the-1990s-and-beyond/), an increase in production of high-alcohol, fruity red wines thanks to the influence of wine critic Robert Parker.
* Red and white wines taste different, obviously, but people always use [goofy words to describe them](https://winefolly.com/tutorial/40-wine-descriptions/)
* Once upon a time in 1976 [California wines proved themselves against France](https://en.wikipedia.org/wiki/Judgment_of_Paris_(wine)) and France got very angry about it

In [19]:
# whether a wine is white or red based on the wine's description
#whether alcohol content depends on country of origin
# finding out correlation between wine_points and price
# finding out correlation between wine_points and wine desc
# finding out correlation between wine_points and variety
# finding out correlation between wine desc and categor
# finding out correlation between variety and country

## But machine learning?

Well, you can usually break machine learning down into a few different things. These aren't necessarily perfect ways of categorizing things, but eh, close enough.

* **Predicting a number**
    - Linear regression
    - For example, how does a change in unemployment translate into a change in life expectancy?
* **Predicting a category** (aka classification)
    - Lots of algos options: logistic regression, random forest, etc
    - For example, predicting cuisines based on ingredients
* **Seeing what influences a numeric outcome**
    - Linear regression since the output is a number
    - For example, minority and poverty status on test scores 
* **Seeing what influences a categorical outcome**
    - Logistic regression since the output is a category
    - Race and car speed for if you get a waring vs ticket
    - Wet/dry pavement and car weight if you survive or not in a car crash)

We have numbers, we have categories, we have all sorts of stuff. **What are some ways we can mash them together and use machine learning?**

### Brainstorm some ideas

Use the categories above to try to come up with some ideas. Be sure to scroll up where I break down categories vs numbers vs text!

**I'll give you one idea for free:** if you don't have any ideas, start off by creating a classifier that determines whether a wine is white or red based on the wine's description.

In [20]:
# Linear regression - how does the price affect wine ratings?
# Predicting a category - predicting wine category based on wine description 
# Seeing what influences a numeric outcome - 
# country and variety on ratings

You can also go to https://library.columbia.edu and see if you can find some academic papers about wine. I'm sure they'll inspire you! (and they might even have some ML ideas in them you can steal, too)

# Implement 2 of your machine learning ideas

## IDEA 1: Linear regression - how does the year affect wine ratings?

In [21]:
df_lr = df[['year', 'wine_points']]

In [22]:
df_lr.head()

Unnamed: 0,year,wine_points
0,7.0,90.0
6,7.0,90.0
7,7.0,90.0
10,7.0,90.0
11,7.0,90.0


In [23]:
df_lr.isna().sum()

year           0
wine_points    0
dtype: int64

In [24]:
df_lr.dtypes

year           float64
wine_points    float64
dtype: object

In [25]:
df_lr.year = df_lr.year.astype(float)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [26]:
import statsmodels.api as sm

# What effect does the year have on ratings?
X = df_lr[['year']]
y = df_lr.wine_points

model = sm.OLS(y, sm.add_constant(X))
results = model.fit()

In [27]:
results.params

const    90.725928
year     -0.210720
dtype: float64

In [28]:
results.summary()

0,1,2,3
Dep. Variable:,wine_points,R-squared:,0.068
Model:,OLS,Adj. R-squared:,0.068
Method:,Least Squares,F-statistic:,975.3
Date:,"Thu, 01 Apr 2021",Prob (F-statistic):,1.08e-206
Time:,22:11:13,Log-Likelihood:,-34334.0
No. Observations:,13318,AIC:,68670.0
Df Residuals:,13316,BIC:,68690.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,90.7259,0.064,1412.621,0.000,90.600,90.852
year,-0.2107,0.007,-31.230,0.000,-0.224,-0.197

0,1,2,3
Omnibus:,220.82,Durbin-Watson:,0.381
Prob(Omnibus):,0.0,Jarque-Bera (JB):,204.689
Skew:,-0.262,Prob(JB):,3.57e-45
Kurtosis:,2.693,Cond. No.,22.3


In [29]:
# ANALYSIS
# For every "year" a wine ages, it gets 0.15 less "points"
# I'm guessing that's b/c older wine tends to be more expensive


In [30]:
import statsmodels.formula.api as smf
import numpy as np

model = smf.ols(formula='wine_points ~ np.multiply(year, 100)', data=df_lr)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,wine_points,R-squared:,0.068
Model:,OLS,Adj. R-squared:,0.068
Method:,Least Squares,F-statistic:,975.3
Date:,"Thu, 01 Apr 2021",Prob (F-statistic):,1.08e-206
Time:,22:11:13,Log-Likelihood:,-34334.0
No. Observations:,13318,AIC:,68670.0
Df Residuals:,13316,BIC:,68690.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,90.7259,0.064,1412.621,0.000,90.600,90.852
"np.multiply(year, 100)",-0.0021,6.75e-05,-31.230,0.000,-0.002,-0.002

0,1,2,3
Omnibus:,220.82,Durbin-Watson:,0.381
Prob(Omnibus):,0.0,Jarque-Bera (JB):,204.689
Skew:,-0.262,Prob(JB):,3.57e-45
Kurtosis:,2.693,Cond. No.,2210.0


## IDEA 2: Predicting a category - predicting wine category based on wine description 

In [31]:
df_P = df[['wine_desc', 'category']]

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

# Make a vectorizer
vectorizer = CountVectorizer()

# Learn and count the words in df.content
matrix = vectorizer.fit_transform(df_P.wine_desc)

# Convert the matrix of counts to a dataframe
words_df = pd.DataFrame(matrix.toarray(),
                        columns=vectorizer.get_feature_names())

In [48]:
words_df.head(5)

Unnamed: 0,000,01,02,02s,04,05,06,06s,07,08,09,10,100,1000,11,110,114,117,12,125,1290,12th,13,13th,14,14th,15,150,1500,150th,155,15th,16,1610,1667,166th,1674,16th,17,1756,1763,1789,1791,17th,18,180,1800,1806,1850s,1855,1894,18th,19,190,1900,1901,1904,1910,1914,1920s,1922,1930,1930s,1932,1934,1935,1936,1939,1940s,1946,1947,1950,1951,1954,1955,1971,1974,1980s,1983,1984,1985,1986,1990s,1991,1994,1995,1996,1997,1998,1999,19th,20,200,2000,2000s,2001,2001s,2002,2002s,2003,...,wrinkles,write,writers,written,wrong,wrought,würzgarten,xarel,xarello,xavier,ximenez,ximénez,xinomavro,xiv,yalumba,yangarra,yarra,yarrow,yauquén,yealands,year,yearns,years,yeast,yeastiness,yeasts,yeasty,yecla,yellow,yellowing,yellowish,yering,yes,yesterday,yet,yianni,yield,yielded,yielding,yields,yogev,yogurt,yonne,york,you,young,younger,youngest,youngster,your,yours,yourself,youth,youthful,youthfully,ysios,yummy,yung,yup,yuzu,yves,zamora,zantho,zap,zaps,zealand,zellenberg,zeller,zelma,zero,zest,zestier,zestiness,zesty,zibibbo,zierfandler,zimmermann,zin,zinfandel,zing,zinger,zings,zingy,zip,zippiness,zippy,zips,zone,zuccardi,zull,zuri,zweigelt,zwerithaler,zwiegelt,zédé,zéro,élevage,émilion,épernay,ürziger
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [34]:
df_P['is_red'] = (df_P['category'] == 'Red').astype(int)
df_P.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_P['is_red'] = (df_P['category'] == 'Red').astype(int)


Unnamed: 0,wine_desc,category,is_red
0,"Inky, minerally aromas of blackberry, black pl...",Red,1
6,The two-acre Clos du Chapitre vineyard is in t...,Red,1
7,"Spice, licorice and herbal notes complement re...",Red,1
10,"Honey-kissed peaches waft from this ripe, rich...",White,0
11,"Full and ripe, this offers balance between ric...",Sparkling,0


In [35]:
X = words_df
y = df_P.is_red

In [36]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [37]:
from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier(max_depth=5)
clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [38]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not red', 'red'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not red,Predicted red
Is not red,1579,68
Is red,295,1388


In [39]:
from sklearn.svm import LinearSVC 
clf = LinearSVC(max_iter=10000)
clf.fit(X_train, y_train)

LinearSVC(max_iter=10000)

In [40]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['not red', 'red'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted not red,Predicted red
Is not red,1581,66
Is red,55,1628


In [41]:
import eli5

eli5.show_weights(clf, feature_names=vectorizer.get_feature_names())

Weight?,Feature
+1.102,loganberries
+0.989,raisiny
+0.975,2023
+0.941,imparting
+0.863,tannins
… 2950 more positive …,… 2950 more positive …
… 2575 more negative …,… 2575 more negative …
-0.859,crispness
-0.926,relatively
-0.938,chardonnay


In [42]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
df_P['category_label'] = le.fit_transform(df_P.category)
df_P.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_P['category_label'] = le.fit_transform(df_P.category)


Unnamed: 0,wine_desc,category,is_red,category_label
0,"Inky, minerally aromas of blackberry, black pl...",Red,1,3
6,The two-acre Clos du Chapitre vineyard is in t...,Red,1,3
7,"Spice, licorice and herbal notes complement re...",Red,1,3
10,"Honey-kissed peaches waft from this ripe, rich...",White,0,6
11,"Full and ripe, this offers balance between ric...",Sparkling,0,5


In [43]:
X = words_df
y = df_P.category_label

In [44]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [45]:
from sklearn.svm import LinearSVC 
clf = LinearSVC(max_iter=10000)
clf.fit(X_train, y_train)

LinearSVC(max_iter=10000)

In [46]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(le.classes_)
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

Unnamed: 0,Predicted Dessert,Predicted Fortified,Predicted Port/Sherry,Predicted Red,Predicted Rose,Predicted Sparkling,Predicted White
Is Dessert,37,1,0,9,1,6,27
Is Fortified,0,0,2,1,0,1,0
Is Port/Sherry,0,0,35,13,0,2,4
Is Red,0,0,8,1619,8,7,26
Is Rose,0,0,2,27,61,12,12
Is Sparkling,1,0,0,16,21,232,54
Is White,11,1,2,25,5,37,1004


In [47]:
from sklearn.metrics import confusion_matrix

y_true = y_test
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(le.classes_)
scores = pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names).div(matrix.sum(axis=1), axis=0)
scores.style.background_gradient(cmap='YlGnBu')

Unnamed: 0,Predicted Dessert,Predicted Fortified,Predicted Port/Sherry,Predicted Red,Predicted Rose,Predicted Sparkling,Predicted White
Is Dessert,0.45679,0.012346,0.0,0.111111,0.012346,0.074074,0.333333
Is Fortified,0.0,0.0,0.5,0.25,0.0,0.25,0.0
Is Port/Sherry,0.0,0.0,0.648148,0.240741,0.0,0.037037,0.074074
Is Red,0.0,0.0,0.004796,0.970624,0.004796,0.004197,0.015588
Is Rose,0.0,0.0,0.017544,0.236842,0.535088,0.105263,0.105263
Is Sparkling,0.003086,0.0,0.0,0.049383,0.064815,0.716049,0.166667
Is White,0.010138,0.000922,0.001843,0.023041,0.004608,0.034101,0.925346
