## Summary

This project aims to classify wine on the basis of wine tasting reviews with some text analysis and modeling. The data set is publicly available on [Kaggle](https://www.kaggle.com/zynicide/wine-reviews). There are in total around 130K records in the data set. The summary of the columns are presented below.

* **country** : The country that the wine is from
* **description**: description of the taster
* **designation**: The vineyard within the winery where the grapes that made the wine are from
* **points**: The number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >= 80)
* **price**: The cost for a bottle of the wine
* **province**: The province or state that the wine is from
* **region_1**: The wine growing area in a province or state (eg: Napa)
* **region_2**: Sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
* **taster_name**: name of the taster
* **taster_twitter_handle**: twitter handle for the taster
* **title**: The title of the wine review, which often contains the vintage if you're interested in extracting that feature
* **variety**: The type of grapes used to make the wine (eg: Pinot Noir)
* **winery**: The winery that made the wine


We are going to use the **description** column as the input and predict the varieties of the wine from the labels in **variety** column.



In [4]:
import warnings
warnings.filterwarnings('ignore')

In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn import svm 
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from collections import defaultdict
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers
from keras.models import Sequential
from keras.models import Sequential
from keras.layers import Dense, Conv1D, Flatten
from keras.layers.embeddings import Embedding
import nltk
nltk.download('stopwords')


  from numpy.core.umath_tests import inner1d
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


[nltk_data] Downloading package stopwords to /home/jia/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Preliminary analysis

We first load the dataset from the corresponding directory and obtain an overview of it. 

In [2]:
data = pd.read_csv('/home/yiwei/yiwei_data/winemag-data-130k-v2.csv', index_col = 0)

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129971 entries, 0 to 129970
Data columns (total 13 columns):
country                  129908 non-null object
description              129971 non-null object
designation              92506 non-null object
points                   129971 non-null int64
price                    120975 non-null float64
province                 129908 non-null object
region_1                 108724 non-null object
region_2                 50511 non-null object
taster_name              103727 non-null object
taster_twitter_handle    98758 non-null object
title                    129971 non-null object
variety                  129970 non-null object
winery                   129971 non-null object
dtypes: float64(1), int64(1), object(11)
memory usage: 13.9+ MB


We remove the one record with *NULL* variety label and keep only the **description** and **variety** for the analysis.

In [4]:
col = ['description', 'variety']

In [5]:
df = data[col]

In [6]:
df = df[pd.notnull(df['variety'])]

There are $707$ unique wine varieties in total. We only take the top 25 varieties as categories for prediction and remove those records whose variety is not among the top 25. 

In [7]:
len(df['variety'].unique())

707

In [8]:
top_25 = df.groupby('variety').count().sort_values('description', ascending = False)[0:25]

In [9]:
top_25

Unnamed: 0_level_0,description
variety,Unnamed: 1_level_1
Pinot Noir,13272
Chardonnay,11753
Cabernet Sauvignon,9472
Red Blend,8946
Bordeaux-style Red Blend,6915
Riesling,5189
Sauvignon Blanc,4967
Syrah,4142
Rosé,3564
Merlot,3102


In [10]:
label = {variety: num for num, variety in enumerate(top_25.index.tolist())}

In [11]:
#{value: key for key, value in label.items()}

In [12]:
label

{'Bordeaux-style Red Blend': 4,
 'Bordeaux-style White Blend': 24,
 'Cabernet Franc': 21,
 'Cabernet Sauvignon': 2,
 'Champagne Blend': 20,
 'Chardonnay': 1,
 'Grüner Veltliner': 22,
 'Malbec': 13,
 'Merlot': 9,
 'Nebbiolo': 10,
 'Pinot Gris': 19,
 'Pinot Noir': 0,
 'Portuguese Red': 14,
 'Portuguese White': 23,
 'Red Blend': 3,
 'Rhône-style Red Blend': 18,
 'Riesling': 5,
 'Rosé': 8,
 'Sangiovese': 12,
 'Sauvignon Blanc': 6,
 'Sparkling Blend': 16,
 'Syrah': 7,
 'Tempranillo': 17,
 'White Blend': 15,
 'Zinfandel': 11}

In [13]:
df_top = df[df['variety'].isin(label.keys())]

### Dataset preparation


In [14]:
description_list = df_top['description'].tolist()

In [15]:
variety_list = [label[i] for i in df_top['variety'].tolist()]

In [16]:
train_x, test_x, train_y, test_y = train_test_split(description_list, variety_list, test_size=0.3, random_state = 1216)

In [17]:
len(train_x)

70163

In [18]:
len(test_x)

30070

### Feature engineering

In this section, we are going to process the raw text data in description and tranform the text into feature vectors. The following ideas would be explored. 

* TF-IDF vectors on word level as features: we use the regular expression [\w\\\%\']+ to tokenize the text with accents and specify the stopword set as the stopwords from NLTK library. Futhermore, we ignore the tokens that appear in less than $3$ descriptions when building the vocabulary by setting *min_df* = $3$.
* TF-IDF vectors on N-gram level as features: we use the same *token_pattern* as TF-IDF on word level and set the ngram_range to $2$ to $3$.  
* word embedding vectors as features: we use the pre-trained Glove word embeddings with IDF weightings to build feature vectors.

#### TF-IDF on word level


In [19]:
tfidf_vect = TfidfVectorizer(analyzer = 'word',token_pattern='[\w\\%\']+', strip_accents = 'unicode', stop_words=set(nltk.corpus.stopwords.words('english')), min_df = 3)
tfidf_vect.fit(train_x)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words={'will', 'than', "you're", 'down', 'o', 'does', 'have', 'me', 'were', "you'll", 'in', 'y', 'and', 'this', 'a', 'hers', 'it', 'doing', 'do', 'didn', 'our', 'the', 'they', 'those', 'mightn', 'itself', 'is', 'their', "couldn't", 'having', 'further', "weren't", 'myself', 'until', 'we', 'yours..., 'all', 'aren', 'of', "you'd", 'yours', 'between', 'i', "you've", "mightn't", 'or', 'his', 'about'},
        strip_accents='unicode', sublinear_tf=False,
        token_pattern="[\\w\\%']+", tokenizer=None, use_idf=True,
        vocabulary=None)

In [21]:
with open('/home/yiwei/yiwei_data/vectorizerDesc.pk', 'wb') as file:
     pickle.dump(tfidf_vect, file)

In [22]:
len(tfidf_vect.vocabulary_)

12618

In [23]:
xtrain_tfidf =  tfidf_vect.transform(train_x)
xtest_tfidf =  tfidf_vect.transform(test_x)

#### TF-IDF on N-gram level

In [24]:
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern='[\w\\%\']+',strip_accents = 'unicode', ngram_range=(2,3), stop_words=set(nltk.corpus.stopwords.words('english')), min_df = 3)
tfidf_vect_ngram.fit(train_x)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=3,
        ngram_range=(2, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words={'will', 'than', "you're", 'down', 'o', 'does', 'have', 'me', 'were', "you'll", 'in', 'y', 'and', 'this', 'a', 'hers', 'it', 'doing', 'do', 'didn', 'our', 'the', 'they', 'those', 'mightn', 'itself', 'is', 'their', "couldn't", 'having', 'further', "weren't", 'myself', 'until', 'we', 'yours..., 'all', 'aren', 'of', "you'd", 'yours', 'between', 'i', "you've", "mightn't", 'or', 'his', 'about'},
        strip_accents='unicode', sublinear_tf=False,
        token_pattern="[\\w\\%']+", tokenizer=None, use_idf=True,
        vocabulary=None)

In [25]:
len(tfidf_vect_ngram.vocabulary_)

160671

In [26]:
xtrain_ngram =  tfidf_vect_ngram.transform(train_x)
xtest_ngram =  tfidf_vect_ngram.transform(test_x)

#### Word embedding


We use the pre-trained Glove word embeddings to tranform each word in the text into a $100$-dimensional vectors. First of all, we load the embeddings as a dictionary with $400000$ key-value pairs. The easiest way to build features with word embeddings is to average the word vectors for all words in the text. Here we first tokenize the text using tfidf_vect (also remove stopwords) and then use the inverse document frequency(IDF) as the weightings for the word vectors to obtain feature vectors. 

**Note**: to deal with the words that have never been seen, we set the default weighting for unseen words to be maximum of all the IDF's as it has to be less frequently seen than any of the known words. 

In [27]:
with open("/home/yiwei/yiwei_data/glove.6B.100d.txt", "rb") as lines:
    wordVec = {line.split()[0].decode('utf-8'): np.array(list(map(float, line.split()[1:])))
           for line in lines}

In [28]:
# use the tokenizer from tfidf_vect
text_tokenizer = tfidf_vect.build_tokenizer()

In [29]:
max_idf = max(tfidf_vect.idf_)

In [30]:
weight = defaultdict(lambda: max_idf, [(token, tfidf_vect.idf_[i]) for token, i  in tfidf_vect.vocabulary_.items()])

In [31]:
xtrain_tokens  = [text_tokenizer(doc) for doc in train_x]

In [32]:
xtrain_embedding =  np.array([
        np.mean([wordVec[token]*weight[token] for token in text if token in wordVec] 
                 or [np.zeros(100)], axis=0)
        for text in xtrain_tokens
        ])

In [33]:
xtest_tokens  = [text_tokenizer(doc) for doc in test_x]

In [34]:
xtest_embedding =  np.array([
        np.mean([wordVec[token]*weight[token] for token in text if token in wordVec] 
                 or [np.zeros(100)], axis=0)
        for text in xtest_tokens
        ])

### Modelling

We explore three classical classifiers for this problem:
* multinomial Naive Bayes classifier;
* support vector machine (SVM);
* neural networks
    

To avoid writing repeated code, we put together a function for fitting different models with different feature vectors.

In [35]:
def train_model(clf, feature_train, label_train, feature_test, label_test, name):

    clf.fit(feature_train, label_train)
    
    # save the model to disk
    filename = '/home/yiwei/yiwei_data/'+name+'.sav'
    pickle.dump(clf, open(filename, 'wb'))
    
    y_predict = clf.predict(feature_test)
    
    
    return metrics.accuracy_score(y_predict, label_test )

#### Multinomial Naive Bayes


In [92]:
print('Multinomial Naive Bayes Accuracy with TF-IDF on word level: %0.4f' % train_model(MultinomialNB(), xtrain_tfidf, train_y, xtest_tfidf, test_y, name = 'MultiNB'))

In [82]:
print('Multinomial Naive Bayes Accuracy with TF-IDF on N-gram level: %0.4f' % train_model(MultinomialNB(), xtrain_ngram, train_y, xtest_ngram, test_y, name = 'MultiNB_ngram'))

Multinomial Naive Bayes Accuracy with TF-IDF on N-gram level: 0.4738


Because Multinomial Naive Bayes is intended for non-negative input, we use Gaussian Naive Bayes classifier to predict the wine variety based on the word embedding vectors.

In [83]:
print('Multinomial Naive Bayes Accuracy with word embedding: %0.4f' % train_model(GaussianNB(), xtrain_embedding, train_y, xtest_embedding, test_y, name = 'MultiNB_embedding'))

Multinomial Naive Bayes Accuracy with word embedding: 0.1473


#### SVM model

In [87]:
print('SVM Accuracy with TF-IDF on word level: %0.4f' % train_model(svm.SVC(kernel = 'linear'), xtrain_tfidf, train_y, xtest_tfidf, test_y, name = 'svcModel'))

SVM Accuracy with TF-IDF on word level: 0.7132


In [None]:
print('SVM Accuracy with TF-IDF on N-gram level: %0.4f' % train_model(svm.SVC(kernel = 'linear'), xtrain_ngram, train_y, xtest_ngram, test_y, name = 'svcModel_ngram'))

SVM accuracy with TF-IDF on N-gram level : 0.6262

In [None]:
print('SVM Accuracy with word embedding: %0.4f' % train_model(svm.SVC(kernel = 'linear'), xtrain_embedding, train_y, xtest_embedding, test_y, name = 'svcEmbedding'))

#### Neural network


To prepare the input data for the neural network, we first examine the length of the tokens in the training data and pad the sequence with Keras function *pad_sequences* with *maxlen* = $200$.

In [36]:
len(tfidf_vect.vocabulary_)

12618

In [37]:
max([len(doc) for doc in xtrain_tokens])

136

In [102]:
from keras.preprocessing.text import Tokenizer
tokenizerKeras= Tokenizer(num_words= 1000, filters='!"#$&()*+,-./:;<=>?@[\]^_`{|}~', lower=True, split=' ', char_level=False, oov_token=None)

In [103]:
tokenizerKeras.fit_on_texts(xtrain_tokens)

In [104]:
xtrain_seq = sequence.pad_sequences(tokenizerKeras.texts_to_sequences(xtrain_tokens), maxlen=200,  dtype='int32', padding='pre', truncating='pre', value=0.0)

In [105]:
xtest_seq = sequence.pad_sequences(tokenizerKeras.texts_to_sequences(xtest_tokens), maxlen=200,  dtype='int32', padding='pre', truncating='pre', value=0.0)

In [106]:
xtrain_seq.shape

(70163, 150)

In [107]:
from keras.utils import to_categorical

In [108]:
train_y_encoded.shape

(70163, 25)

In [109]:
train_y_encoded = to_categorical(train_y)
test_y_encoded = to_categorical(test_y)

In [111]:
model = Sequential()

model.add(Embedding(input_dim = 20000, output_dim = 64, input_length = 200))
model.add(Flatten())
model.add(Dense(64, input_shape=(64,),activation='relu'))
model.add(Dense(25, input_shape= (100,) ,activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit(xtrain_seq, train_y_encoded, epochs=3, batch_size=64)

#### Final model trained on all data points

In [None]:
tfidf_vect.fit(description_list)

In [None]:
x_tfidf =  tfidf_vect.transform(description_list)

In [None]:
clf = svm.SVC(kernel = 'linear')

In [None]:
clf.fit( x_tfidf, variety_list)

In [None]:
filename = '/home/yiwei/yiwei_data/SVMmodel.sav'
    pickle.dump(clf, open(filename, 'wb'))