# Regression: Using wine descriptions to predict wine prices

Let's add a new tool to our toolbox: **regression**. Basically speaking, it predicts numbers.

**Regression** predicts **continuous** variables, like \$7 or \$7.50 or \$9.35. **Normal number stuff.**

Previously we've predicted using **classification**, which predicts **categorical** variables, like red vs. white.

> You might get confused that when we did classification our categories were numbers - red might be `2` and white might be `1` and dessert wine might be `0`. But!! The reason regression is different is that you'll **never** have the computer say "oh it's like 1.5, a little more than white and a little less than red." With classification, a prediction is always just one specific category.

So in short, **if you're trying to predict a number, you want regression.** Let's see how it works!

## Preparing our data

### Step 1.1: Read in our data

First, we're going to read in our data. It's wine reviews!

While we're reading it in, we're also going to convert the alcohol and price levels to integers. They come in as strings because I didn't clean the data very well earlier!

**Be sure to move this notebook into the same directory as your wine reviews CSV!**

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as plt

df = pd.read_csv("wine-reviews.csv")

# Clean up alcohol and price and convert to integers
df['alcohol_int'] = df.alcohol.str.extract("(\d+)", expand=False).dropna().astype(int)
df['price_int'] = df.price.str.extract("(\d+)", expand=False).dropna().astype(int)

# Get rid of anything without a description or a price
df.dropna(subset=['price_int', 'wine_desc'], how='any', inplace=True)

df.head(3)

Unnamed: 0,alcohol,appellation,bottle size,category,date published,designation,importer,price,taster,url,user avg rating,variety,wine_desc,wine_name,wine_points,winery,alcohol_int,price_int
0,14%,"Barolo, Piedmont, Italy",750 ml,Red,9/1/2017,Cannubi,Oliver McCrum Wines,"$60, Buy Now",Kerin O’Keefe,http://www.winemag.com/buying-guide/brezza-201...,Not rated yet [Add Your Review],Nebbiolo,One of the best expressions from the classic C...,Brezza 2013 Cannubi (Barolo),98.0,Brezza,14.0,60.0
2,14.5%,"Barolo, Piedmont, Italy",750 ml,Red,9/1/2017,Vigna Rionda Riserva,Vineyard Brands,"$151, Buy Now",Kerin O’Keefe,http://www.winemag.com/buying-guide/massolino-...,Not rated yet [Add Your Review],Nebbiolo,From one of the most celebrated vineyards in t...,Massolino 2011 Vigna Rionda Riserva (Barolo),98.0,Massolino,14.0,151.0
3,14%,"Barolo, Piedmont, Italy",750 ml,Red,9/1/2017,Monvigliero,Bacchanal Wine Imports,"$70, Buy Now",Kerin O’Keefe,http://www.winemag.com/buying-guide/comm-g-b-b...,Not rated yet [Add Your Review],Nebbiolo,Always the firm's showstopper and one of the b...,Comm. G. B. Burlotto 2013 Monvigliero (Barolo),98.0,Comm. G. B. Burlotto,14.0,70.0


## Creating features

Same as with classification, our features need to be **numbers**. The regression model won't recognize words!

### Step 2.0: If you aren't analyzing text, you'll just move some columns into `features_df` like

```
features_df = df[['age','weight']]
```

but in this case, we *are* analyzing text, so...

### Step 2.1: If analyzing text, create your vectorizer

Which kind of vectorizer are you using? A **CountVectorizer** to only count values, or a **TfIdfVectorizer** to count percentages? Once you've figured it out, answer the following questions.

1. **vocabulary**: are you looking for a specific set of words? It's just a normal list. If you don't give a vocabulary, the computer will figure out the list of words for you (it's pretty good at that!) - so you don't usually use a vocabulary unless you have a REALLY GOOD REASON.
1. **ngram_range**: are you only vectorizing single words, or are you also looking at multi-word phrases? By default it only looks for one word `(1,1)`, but you can look for 1-2 word phrases `(1,2)`, only 4-word phrases `(4,4)`, etc.
1. **binary**: Do you want to just test to see if a word is included or not, and don't care about counting? `True` or `False`.
1. **tokenizer**: are you going to do any stemming or lemmatization, or are you okay with the existing words?
1. **stop_words**: do you use stopwords? Stopwords are useless for judging content, but good for judging style. `english` will give default words, or use a list to use multiple.
1. **max_features**: if your classifier or regression is too slow, you might want to limit the number of features that the model will use. By default it uses unlimited features, but you can say hey no, use `500` or something like that.
1. **max_df**: do you want to not include words that show up in a lot of documents? `0.0`-`1.0` to have a percentage as a ceiling, or an integer to have a maximum number of documents. For example, "5" means "Ignore anything that shows up in more than 5 documents" 
1. **min_df**: do you want to not include words that show up in a only a few documents? `0.0`-`1.0` to have a percentage as a floor, or an integer to have a maximum number of documents. For example, "0.05" means "Ignore anything that shows up in fewer than 5% of documents" 
1. **use_idf**: do you want to use inverse document frequency, which makes less frequent words more important? (`TfidfVectorizer` only)

### Step 2.1: Okay, let's actually create our vectorizer

We're going to use a `CountVectorizer`.

* We're also going to use stop words, since **"and"** and **"the"** probably aren't important to wine prices.
* We're going to use `max_features=4000` to stick to 4000 features so our regression won't be as terribly slow as with all 6000. It'll still be slow, though!

I'm also going to **remove numbers from the description** by feeding our vectorizer `df['wine_desc'].str.replace("\d","")`. I just don't think numbers matter!

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
vec = CountVectorizer(stop_words = 'english',
                      max_features = 1500,
                      ngram_range=(1,2),
                      binary=True, 
                      token_pattern=r'\b[a-zA-Z][a-zA-Z]+\b')
# Let's also remove numbers before we count words
matrix = vec.fit_transform(df['wine_desc'].str.replace("\d",""))
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
features_df.head()

Unnamed: 0,abrasive,abundant,accent,accented,accents,accessible,acid,acidic,acidity,acidity drink,...,yellow pear,young,young wine,youthfully,zest,zestiness,zesty,zesty acidity,zinfandel,zippy
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0


# Preparing to do the work

### Step 3.1: Train/test split

Some data you'll train your regression model with, some you'll use for testing. That's how you know how good your model is!

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, # these are our words
    df.price_int, # these are our prices
    test_size = 0.2)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3131, 1500) (783, 1500) (3131,) (783,)


## Doing the regression

### Step 4.1 Creating and training your model

You're just going to use a **linear regression**... because I said so.

* **X_train** is our features to learn about (our vectorized descriptions, aka our word counts)
* **y_train** is the prices for the wines we're learning about

You can look at them individually if you want! This is just telling our model to learn what features might be connected to higher or lower prices.

In [5]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression(fit_intercept = 0)
lr.fit(X_train, y_train)



LinearRegression(copy_X=True, fit_intercept=0, n_jobs=1, normalize=False)

That error message is okay, don't worry about it! **And if it's taking forever?** You have too many features! Maybe go back up to your `CountVectorizer` and add `max_features=500` so it isn't looking at like 6000 variables at a time. But maybe try it with a lot for the first time, just to see. Go make a coffee or something while you wait.

### Step 4.2 Testing your model

To test your model, you say "hey, how far are my REAL data points from what the regression would have predicted?" A little math later and this is the [R squared](https://en.wikipedia.org/wiki/Coefficient_of_determination). It's the percent explained by your model. `0.30` would mean "your predictor explains 30% of the price."

Speaking practically, the larger the better. 100% would mean you did a perfect job!

We show you the R squared for test and train below, **but remember: it's only important for the test data!!!** Being proud of a score with your training data is just silly, since your model is basically cheating.

In [6]:
from sklearn.metrics import mean_squared_error, r2_score

print('R2 on train:', r2_score(y_train, lr.predict(X_train)), '\nMean Squared Error on train:', mean_squared_error(y_train, lr.predict(X_train)))

print('R2 on test:', r2_score(y_test, lr.predict(X_test)), '\nMean Squared Error on train:', mean_squared_error(y_test, lr.predict(X_test)))

R2 on train: 0.725644876553 
Mean Squared Error on train: 215.848315152
R2 on test: 0.0178056214745 
Mean Squared Error on train: 949.191539435


How did we do? **Not so well, but there's room for improvement.**

> **Jager's note:** This classifier is great for interpolation, ie predicting prices based on reviews which lie within the set of reviews the model was trained on. **This is due to the varied vocabulary of the sommelier.**  The model fails to extrapolate to the test set, also due to the varied nature of the vocabulary.  A negative R2 score reported by sklearn means the model performs arbitrarily worse than just guessing the mean price.

Why didn't we do so well? Basically, our model paying a lot of attention to **very specific words** from the training set, and then gets confused when it doesn't see those words in the test data. We should probably have it trained with  **more general words!**

### Step 4.3: Having a little fun for a second

Let's use our regression to test a couple reviews we wrote ourselves! We use **vec.transform** here instead of `fit_transform` because "fit" means "learn words" but here we just want to say "hey we know the words already, just count the ones we've seen before."

In [7]:
# Run this, see the prices. Then change 'hills' to 'hillsides'
samples = [
    "This Alsacian wine smells like blueberries and cherries",
    "A hand-crafted red from the sun swept hills of China"
]
matrix = vec.transform(samples)
lr.predict(matrix)

array([ 10.55603911,  30.36262418])

### 4.4 Understanding the components of our model

We're going to use this code that Jager so kindly wrote for us! You can change the `n = 10` line if you want to see more or fewer variables.

In [8]:
def print_top_words(model, feature_names, n_top_words):
    print('\n--------------------------------\n')
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % (topic_idx+1))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
n = 10        
bottom_n = np.argsort(lr.coef_)[n::-1]
top_n = np.argsort(lr.coef_)[-n:]

feature_names = vec.get_feature_names()

print('Largest Contributors to High Price\n')
print("\n".join(feature_names[i] + " " + str(lr.coef_[i]) for i in top_n))

print('\n\nLargest Contributors to Low Price\n')
print("\n".join(feature_names[i] + " " + str(lr.coef_[i]) for i in bottom_n))

Largest Contributors to High Price

expansive 28.238131849
chocolaty 28.4994276672
peppercorn 29.3282793048
ll 31.5578864525
cherry cassis 33.5772501008
charred 34.3414600864
seamless 38.7384918402
petit verdot 42.1924777363
coffee bean 48.5842990317
lip smacking 58.9694194767


Largest Contributors to Low Price

decent -20.4865857557
new leather -22.2599221879
tilled -22.7344501559
herb cherry -22.9483144095
bone dry -23.4254817587
tannins leave -29.61429592
smacking -30.6271449712
blackberry jam -30.9546531822
doles -31.1625751361
roasted coffee -40.4129556457
petit -43.7625182747


Those words probably seem... kind of stupid. 

## !!! Your assignment is to improve our measurements and get some sensible words in these lists!!!

# All of the code in one small-ish area, plus some secret stuff

If you don't need the walkthrough and the explanations, here's everyting all at once! You can import the data up at the top, and then use this area to play around with vectorizing variables.

### The secrets to getting a better score

If your r-square is **higher** then you're doing **better**. How high can you get it? Here are some tips:

* Right now your model is paying too much attention to **rare words**, words that show up **infrequently**. Is there an option you can give your vectorizer to not pay attention to rare words?
* I know we talk bad about multiple-word phrases, but "red fruit" versus "citrus fruit" might be important here.
* Try to add **categorical variables** as features. You can't just say "Red is 0, White is 1, etc," but I've included some code to show you how to do it!
* I don't know, just keep running with those ideas!

### Imports

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

### Vectorize

Pay attention to how many columns you have down below. More words = more columns, but that doesn't mean a better result! How can you get rid of words that don't show up that often?

In [10]:
vec = CountVectorizer(stop_words = 'english', max_features=3000)
# Let's also remove numbers before we count words
matrix = vec.fit_transform(df['wine_desc'].str.replace("\d",""))
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())

features_df.head(2)

Unnamed: 0,abandon,abound,abounds,abrasive,abrupt,absolute,absolutely,abundance,abundant,acacia,...,yummy,zest,zestiness,zesty,zinfandel,zing,zingy,zip,zippy,émilion
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Special secret: Adding categorical variables

Want do use this? Uncomment it! Scroll down a lot to read why and how this works. You don't need to do this if you don't want to, it might get overly complicated.

In [19]:
# You can only run this once!!! Then you'll get an error because of duplicate columns.
# If you get an error, re-run your vectorizer up above.

custom = pd.get_dummies(df['category'], prefix="CUSTOM", drop_first=True)
features_df = features_df.join(custom).fillna(0)
features_df.head(2)

Unnamed: 0,abandon,abound,abounds,abrasive,abrupt,absolute,absolutely,abundance,abundant,acacia,...,zing,zingy,zip,zippy,émilion,CUSTOM_Port/Sherry,CUSTOM_Red,CUSTOM_Rose,CUSTOM_Sparkling,CUSTOM_White
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0.0,1.0,0.0,0.0,0.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0.0,0.0,0.0,0.0,0.0


If you want to remove custom columns that aren't used that often, use the code below. This is kind of like `min_idf`! So for example, the `features_df[custom_columns].mean() < 0.005` part below says **remove any custom columns that only show up in less than 0.5% of rows.**

In [77]:
custom_columns = features_df.columns[features_df.columns.str.contains("CUSTOM")]
low_frequency = custom_columns[features_df[custom_columns].mean() < 0.005]
print("Removing column", list(low_frequency))
features_df.drop(low_frequency, axis=1, inplace=True)

Removing column []


### Test/train split + regression

In [76]:
X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, # these are our words
    df.price_int, # these are our prices
    test_size = 0.2)

lr = LinearRegression(fit_intercept = 0)
lr.fit(X_train, y_train)

print('R2 on train:', r2_score(y_train, lr.predict(X_train)), '\nMean Squared Error on train:', mean_squared_error(y_train, lr.predict(X_train)))

print('R2 on test:', r2_score(y_test, lr.predict(X_test)), '\nMean Squared Error on train:', mean_squared_error(y_test, lr.predict(X_test)))

R2 on train: 0.999951954527 
Mean Squared Error on train: 0.0389273689806
R2 on test: -2.29327752598e+21 
Mean Squared Error on train: 2.0056767367e+24


### See the important words

In [None]:
def print_top_words(model, feature_names, n_top_words):
    print('\n--------------------------------\n')
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % (topic_idx+1))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
n = 10        
bottom_n = np.argsort(lr.coef_)[n::-1]
top_n = np.argsort(lr.coef_)[-n:]

feature_names = list(features_df.columns)

print('Largest Contributors to High Price:\n')
print("\n".join(feature_names[i] + " " + str(lr.coef_[i]) for i in top_n))

print('\n\nLargest Contributors to Low Price:\n')
print("\n".join(feature_names[i] + " " + str(lr.coef_[i]) for i in bottom_n))

# SPECIAL SECRET: How to add features based on a category column (e.g. winery, type of wine, etc)

When using categorical variables in a regression, you want to take something like...

|Wine|Category|
|---|---|
|0|Red|
|1|Red|
|2|White|
|3|Dessert|

and turn it into features that look like this:

|CUSTOM_Red|CUSTOM_White|CUSTOM_Dessert|
|---|---|---|
|1|0|0|
|1|0|0|
|0|1|0|
|0|0|1|

(Except only kind of, because of [the dummy variable trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html)) But hey, who cares, whatever, now you know about the hilariously-named `pd.get_dummies`.

In [None]:
df['category'].head()

In [None]:
pd.get_dummies(df['category'], prefix="CUSTOM", drop_first=True).head()

You use `drop_first=True` to make `Dessert Wine` not be a category, because of that "dummy trap" thing I mentioned up above. And we prefix it with "CUSTOM" to keep it different from actual words used.

You can see this in use up above near the "Vectorize" section - I just commented it out so it wouldn't get in the way.

**I"d be happy to talk more about this with you!**