In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function


In [2]:
#import the important modules

import itertools
import os #operating system
import math #math operations
import numpy as np #arrays
import pandas as pd #dataframes
import tensorflow as tf #dataflow programming

from sklearn.preprocessing import LabelEncoder
import keras

layers=keras.layers

print("You have tensorflow version " , tf.__version__)

Using TensorFlow backend.


You have tensorflow version  1.7.0


You might get accuracy at the last like acc:0000e+00 because of the tensorflow version 1.12. I had to install 1.7 again and then it worked.

get the data from the source
URL = "https://storage.googleapis.com/sara-cloud-ml/wine_data.csv"


In [3]:
#get the data from the source

path = 'wine_data.csv'
print(path)

wine_data.csv


In [4]:
#convert data from csv to pandas file
data = pd.read_csv(path)

In [5]:
#shuffle the data
data = data.sample(frac=1)

#print first 5 rows
data.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
74046,74046,Spain,"This is gritty, burnt and stalky smelling. On ...",,81,10.0,Central Spain,Vino de la Tierra de Castilla,,Tempranillo-Shiraz,Volteo
67803,67803,Italy,The strong start—a nose full of slightly candi...,Pratale,84,19.0,Tuscany,Chianti Classico,,Sangiovese,Coli
57195,57195,US,Greenwood Ridge often does a good job at Sauvi...,,80,18.0,California,Anderson Valley,Mendocino/Lake Counties,Sauvignon Blanc,Greenwood Ridge
133375,133375,US,A light entry holds flavors of strawberry and ...,Renegade Ridge Estate,85,75.0,Oregon,Dundee Hills,Willamette Valley,Pinot Noir,Archery Summit
218,218,France,Paradis is a parcel within the Pfingstberg Gra...,Pfingstberg Paradis Grand Cru,93,42.0,Alsace,Alsace,,Riesling,Domaine François Schmitt


In [6]:
#do some preprocessing to limit the no. of varities in the dataset

data = data[pd.notnull(data['country'])]
data = data[pd.notnull(data['price'])]
data = data.drop(data.columns[0],axis=1) # drop labels from index(axis) 1

variety_threshold = 500 #any variety less than 500 will be removed.
value_counts = data['variety'].value_counts() #total number of variety with names
to_remove=value_counts[value_counts<=variety_threshold].index 
data.replace(to_remove,np.nan,inplace=True) #replacing variety with nan 
data=data[pd.notnull(data['variety'])]

In [7]:
#split the dataset into train and test
train_size=int(len(data)*.8)
print("Train size: %d" % train_size)
print("Test size: %d" % (len(data)-train_size))


Train size: 95646
Test size: 23912


Extract training and testing features and all of the label


In [8]:

#Train features
description_train = data['description'][:train_size]
variety_train = data['variety'][:train_size]

#train labels
labels_train=data['price'][:train_size]

#test features
description_test = data['description'][train_size:]
variety_test = data['variety'][train_size:]

#test labels
labels_test=data['price'][train_size:]

Now instead of looking at every word in the description lets limit it to top 12000 words and it can be done through the keras built-in function.

This is considered as wide because the input to each of our model for each description
will be a 12000 element wide vector with zeroes and ones indicating the presence of the word in our
vocabulary in a particular description.


In [9]:

#create a tokenizer to preprocess our text descriptions
vocab_size = 12000 #this is a hyperparameter

tokenize = keras.preprocessing.text.Tokenizer(num_words=vocab_size,char_level=False)

tokenize.fit_on_texts(description_train) #only fit on train


Okay so now that done we'll be actually using text to matrix function
to convert each description to a bag of words vector.

In [10]:
#Wide feature 1 : spare bag of words (bow) vocab_size vector

description_bow_train = tokenize.texts_to_matrix(description_train)
description_bow_test = tokenize.texts_to_matrix(description_test)


In [11]:
#wide feature 2 : one hot vector for variety categories

#use sklearn utility to convert label strings to numbered index

encoder = LabelEncoder()
encoder.fit(variety_train)
variety_train = encoder.transform(variety_train)
variety_test = encoder.transform(variety_test)
num_classes = np.max(variety_train) +1 

#convert labels into one hot

variety_train = keras.utils.to_categorical(variety_train,num_classes)
variety_test = keras.utils.to_categorical(variety_test,num_classes)


Keras has two APIs to build a model 
- Sequential and Functional api. 
and I am going to use the functional api.

As it provides more flexibility and let us combine multiple inputs in our layer. 
Also make our wide and deep model combine into one. 
First we need to define an input layer as a 12K element vector for each vocabulary and then I will 
connect this to our dense output layer to generate the price prediction.



In [12]:
#Define our wide model with functional api

bow_inputs = layers.Input(shape=(vocab_size,))
variety_inputs = layers.Input(shape=(num_classes,))
merged_layer = layers.concatenate([bow_inputs,variety_inputs])
merged_layer = layers.Dense(256,activation='relu')(merged_layer)
predictions = layers.Dense(1)(merged_layer)
wide_model =keras.Model(inputs=[bow_inputs,variety_inputs],outputs=predictions)


 bow_inputs - Tensor("input_20:0", shape=(?, 12000), dtype=float32) <br>
 variety_inputs - Tensor("input_21:0", shape=(?, 40), dtype=float32) <br>
 merged_layer - Tensor("dense_20/Relu:0", shape=(?, 256), dtype=float32)<br>
 prediction - Tensor("dense_21/BiasAdd:0", shape=(?, 1), dtype=float32)


In [13]:
#now print out the summary from wide model

wide_model.compile(loss='mse' , optimizer='adam' ,metrics=['accuracy'])
print(wide_model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 12000)        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 12040)        0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 256)          3082496     concatenate_1[0][0]              
__________

check https://keras.io/models/model/ for proper understanding of .compile() <br>
loss(mse) = mean square error


To create a deep representation of wine description we'll represent it as an embedding . <br>
Well there are lots of resources on word embedding but the short version is that they can provide the map word to vector so that the similar words are closer together in the vector space. <br>
Where to convert the text description to an embedding layer we need to first convert each description to a vector of integers corresponding to each word in our vocabulary. <br>
We can do this with keras text_to_sequences method and we'll use pad_sequences to add zeroes to description vector so 
 that they all are the same length.


In [14]:
#Deep model feature: word embeddings of wine description

train_embed = tokenize.texts_to_sequences(description_train)
test_embed = tokenize.texts_to_sequences(description_test)

max_seq_length = 170

train_embed = keras.preprocessing.sequence.pad_sequences(
    train_embed, maxlen=max_seq_length, padding="post")
test_embed = keras.preprocessing.sequence.pad_sequences(
    test_embed, maxlen=max_seq_length, padding="post")


Now we are ready to create our embedding layer and then feed it to the deep model.
Well there are to ways to create an embedding layer, we can use weights from the pretent embeddings or we can
learn the embeddings from our vocabulary. 
Its best to experiment both and see which performs better.Here I am considering to use learning embeddings. 
Firstly i will define the shape of the input to our deep model and then will feed it to the embedding layer and here I am using an embedding layer with **8 dimensions** and the output
of the embedding layer will be a **3 dimensional vector**.
Inorder to connect our embedding layer to dense fully connected output layer we need to flatten it.


In [15]:

# lets define the model and flatten it

deep_inputs = layers.Input(shape=(max_seq_length,))
embedding = layers.Embedding(vocab_size,8,input_length=max_seq_length)(deep_inputs)
embedding = layers.Flatten()(embedding)
embed_out = layers.Dense(1)(embedding)
deep_model = keras.Model(inputs=deep_inputs,outputs=embed_out)
print(deep_model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 170)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 170, 8)            96000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 1360)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 1361      
Total params: 97,361
Trainable params: 97,361
Non-trainable params: 0
_________________________________________________________________
None


In [16]:
deep_model.compile(loss='mse' , optimizer='adam' ,metrics=['accuracy'])

Create a layer that concatenates outputs from each of the model and then merged them into full connected dense layer 
and finally define a combined model that combined the input and the output from each one.
since both models are predciting the same thing that is price, the outputs and the labels will be the same.
also since the output of our models is a numeric value we dont need to do any preprocessing and it is already 
in a right format


In [17]:
# Wide and deep models are defined,so now will combine them


merged_out = layers.concatenate([wide_model.output,deep_model.output])
merged_out = layers.Dense(1)(merged_out)
combined_model = keras.Model(wide_model.input +[deep_model.input],merged_out)
print(combined_model.summary())

combined_model.compile(loss='mse' , optimizer='adam' ,metrics=['accuracy'])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 12000)        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 170)          0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 12040)        0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________

In [18]:
# Training

combined_model.fit([description_bow_train,variety_train]+[train_embed],labels_train, epochs=10 , batch_size = 64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a5dc4ecf60>

In [19]:
# Evaluate

combined_model.evaluate([description_bow_test,variety_test]+[test_embed],labels_test,batch_size = 64)



[525.3722583289607, 0.0687102709986287]

The loss decreased a lot and the accuracy increased too.
For the first training with only 10 epochs the result is quite good and with more taining better results are surely expected.

In [20]:
# Generate predictions

predictions = combined_model.predict([description_bow_test, variety_test] + [test_embed])

In [21]:
# Compare predictions with actual values for the first few items in our test dataset
num_predictions = 40
diff = 0

for i in range(num_predictions):
    val = predictions[i]
    print(description_test.iloc[i])
    print('Predicted: ', val[0], 'Actual: ', labels_test.iloc[i], '\n')
    diff += abs(val[0] - labels_test.iloc[i])



This charming Sauvignon Blanc smells sweetly of fresh pears and honeydew with a sprinkling of powdered sugar. Full bodied but dry with concentrated white grapefruit flavors, its brightened by a nervy acidity and a refreshing minerality in the midpalate.
Predicted:  23.698996 Actual:  20.0 

Rich and sweet, with blackberry, chocolate, plum pudding and spice flavors, accented with fine acidity. Made from traditional Port varieties plus Petite Sirah, it's a bit watery. The score would soar if the fruity concentration were greater.
Predicted:  55.410152 Actual:  55.0 

Kudos to the winery for holding this estate grown wine back for five-plus years. The wine is now softly dry and enormously complex, showing blackberry and cherry liqueur flavors that are beginning to age into drier, earthier notes of dried fruits and minerals. Cofermentation with some Viognier adds the perfect touch of bright, citrusy acidity.
Predicted:  33.536213 Actual:  38.0 

Attractive wine, boasting a good balance bet

Ripe but basic on the nose, this has dusty apple and sweet, minerally notes that are not sugary. It feels a little blowsy, but there's ample acidic cut to frame the dry, flavors of melon, green banana and apple. This has a simple finish. A blend of 80% Xarello and 20% Riesling.
Predicted:  14.5296755 Actual:  15.0 

In 5–10 years, this rating may look conservative, but right now, this wine's tannins are just too rustic and tough to be certain of its future evolution. It's a massive wine overall, with brambly, briary fruit, tinged with clove, cedar and chocolate, and those drying, astringent tannins on the finish.
Predicted:  72.45181 Actual:  125.0 

This is a plush, meaty wine with a dark, almost purplish hue and remarkable persistence and presence on the palate. Black cherry, ripe blueberries, vanilla, roasted nut and spice appear nicely on the nose without overwhelming the natural fruit. It has a chewy, juicy feel with silky tannins and good length on the finish.
Predicted:  26.9223

In [22]:
# Compare the average difference between actual price and the model's predicted price
print('Average prediction difference: ', diff / num_predictions)



Average prediction difference:  7.250338554382324


### Getting an average prediction difference of 7.25 is really good.  