To run this model directly in the browser with zero setup, open it in [Colab here](https://colab.research.google.com/github/sararob/keras-wine-model/blob/master/keras-wide-deep.ipynb).

In [0]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

In [2]:
# Install the latest version of TensorFlow
!pip install -q -U tensorflow==1.7.0

[K     |████████████████████████████████| 48.0MB 47kB/s 
[K     |████████████████████████████████| 3.1MB 37.7MB/s 
[K     |████████████████████████████████| 890kB 44.3MB/s 
[?25h  Building wheel for html5lib (setup.py) ... [?25l[?25hdone
[31mERROR: magenta 0.3.19 has requirement tensorflow>=1.12.0, but you'll have tensorflow 1.7.0 which is incompatible.[0m


In [3]:
import itertools
import os
import math
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder
from tensorflow import keras
layers = keras.layers

# This code was tested with TensorFlow v1.7
print("You have TensorFlow version", tf.__version__)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


You have TensorFlow version 1.7.0


In [4]:
# Get the data: original source is here: https://www.kaggle.com/zynicide/wine-reviews/data
URL = "https://storage.googleapis.com/sara-cloud-ml/wine_data.csv"
path = tf.keras.utils.get_file(URL.split('/')[-1], URL)

Downloading data from https://storage.googleapis.com/sara-cloud-ml/wine_data.csv


In [0]:
# Convert the data to a Pandas data frame
data = pd.read_csv(path)

In [6]:
# Shuffle the data
data = data.sample(frac=1)

# Print the first 5 rows
data.head()

Unnamed: 0.1,Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
29283,29283,Italy,"Aromas of red berry, rose, toast and a whiff o...",Riserva,90,60.0,Piedmont,Gattinara,,Nebbiolo,Travaglini
124269,124269,Italy,"Compared to the estate's Vigna Cannubi cru, th...",Vigna Castellero,91,,Piedmont,Barolo,,Nebbiolo,Barale Fratelli
27952,27952,Spain,"Yellow in color, this blend of 80% Macabeo and...",Fuenteseca,84,10.0,Levante,Utiel-Requena,,White Blend,Sierra Norte
70097,70097,US,"Genesis is Hogue's upper-tier line, made in mu...",Genesis,87,15.0,Washington,Washington,Washington Other,Riesling,Hogue
88738,88738,US,This is a relatively light style of Washington...,,89,25.0,Washington,Columbia Valley (WA),Columbia Valley,Syrah,L'Ecole No. 41


In [0]:
# Do some preprocessing to limit the # of wine varities in the dataset
data = data[pd.notnull(data['country'])]
data = data[pd.notnull(data['price'])]
data = data.drop(data.columns[0], axis=1) 

variety_threshold = 500 # Anything that occurs less than this will be removed.
value_counts = data['variety'].value_counts()
to_remove = value_counts[value_counts <= variety_threshold].index
data.replace(to_remove, np.nan, inplace=True)
data = data[pd.notnull(data['variety'])]

In [8]:
# Split data into train and test
train_size = int(len(data) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(data) - train_size))

Train size: 95646
Test size: 23912


In [0]:
# Train features
description_train = data['description'][:train_size]
variety_train = data['variety'][:train_size]

# Train labels
labels_train = data['price'][:train_size]

# Test features
description_test = data['description'][train_size:]
variety_test = data['variety'][train_size:]

# Test labels
labels_test = data['price'][train_size:]

In [0]:
# Create a tokenizer to preprocess our text descriptions
vocab_size = 12000 # This is a hyperparameter, experiment with different values for your dataset
tokenize = keras.preprocessing.text.Tokenizer(num_words=vocab_size, char_level=False)
tokenize.fit_on_texts(description_train) # only fit on train

In [0]:
# Wide feature 1: sparse bag of words (bow) vocab_size vector 
description_bow_train = tokenize.texts_to_matrix(description_train)
description_bow_test = tokenize.texts_to_matrix(description_test)

In [0]:
# Wide feature 2: one-hot vector of variety categories

# Use sklearn utility to convert label strings to numbered index
encoder = LabelEncoder()
encoder.fit(variety_train)
variety_train = encoder.transform(variety_train)
variety_test = encoder.transform(variety_test)
num_classes = np.max(variety_train) + 1

# Convert labels to one hot
variety_train = keras.utils.to_categorical(variety_train, num_classes)
variety_test = keras.utils.to_categorical(variety_test, num_classes)

In [0]:
# Define our wide model with the functional API
bow_inputs = layers.Input(shape=(vocab_size,))
variety_inputs = layers.Input(shape=(num_classes,))
merged_layer = layers.concatenate([bow_inputs, variety_inputs])
merged_layer = layers.Dense(256, activation='relu')(merged_layer)
predictions = layers.Dense(1)(merged_layer)
wide_model = keras.Model(inputs=[bow_inputs, variety_inputs], outputs=predictions)

In [14]:
wide_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])
print(wide_model.summary())

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 12000)        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 12040)        0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 256)          3082496     concatenate_1[0][0]              
__________

In [0]:
# Deep model feature: word embeddings of wine descriptions
train_embed = tokenize.texts_to_sequences(description_train)
test_embed = tokenize.texts_to_sequences(description_test)

max_seq_length = 170
train_embed = keras.preprocessing.sequence.pad_sequences(
    train_embed, maxlen=max_seq_length, padding="post")
test_embed = keras.preprocessing.sequence.pad_sequences(
    test_embed, maxlen=max_seq_length, padding="post")

In [16]:
# Define our deep model with the Functional API
deep_inputs = layers.Input(shape=(max_seq_length,))
embedding = layers.Embedding(vocab_size, 8, input_length=max_seq_length)(deep_inputs)
embedding = layers.Flatten()(embedding)
embed_out = layers.Dense(1)(embedding)
deep_model = keras.Model(inputs=deep_inputs, outputs=embed_out)
print(deep_model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 170)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 170, 8)            96000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 1360)              0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 1361      
Total params: 97,361
Trainable params: 97,361
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
deep_model.compile(loss='mse', optimizer='adam', metrics=['accuracy'])

In [18]:
# Combine wide and deep into one model
merged_out = layers.concatenate([wide_model.output, deep_model.output])
merged_out = layers.Dense(1)(merged_out)
combined_model = keras.Model(wide_model.input + [deep_model.input], merged_out)
print(combined_model.summary())

combined_model.compile(loss='mse',
                       optimizer='adam',
                       metrics=['accuracy'])

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 12000)        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 40)           0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 170)          0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 12040)        0           input_1[0][0]                    
                                                                 input_2[0][0]                    
__________

In [19]:
# Run training
combined_model.fit([description_bow_train, variety_train] + [train_embed], labels_train, epochs=10, batch_size=128)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras._impl.keras.callbacks.History at 0x7fbeb07389e8>

In [20]:
combined_model.evaluate([description_bow_test, variety_test] + [test_embed], labels_test, batch_size=128)



[497.65045942993777, 0.06912847106678703]

In [0]:
# Generate predictions
predictions = combined_model.predict([description_bow_test, variety_test] + [test_embed])

In [23]:
# Compare predictions with actual values for the first few items in our test dataset
num_predictions = 40
diff = 0

for i in range(num_predictions):
    val = predictions[i]
    print(description_test.iloc[i])
    print('Predicted: ', val[0], 'Actual: ', labels_test.iloc[i], '\n')
    diff += abs(val[0] - labels_test.iloc[i])

A pretty good bistro-style wine, dry and silky, with cherry, berry, currant, tobacco and spice flavors at an everyday price. It's a blend of Syrah and Petite Sirah.
Predicted:  10.488673 Actual:  12.0 

This lavishly oaked, opaque Zinfandel sports a dense nose of blackberry with pronounced toasty and menthol shadings. The mouthfeel is full and smooth, there's dark fruit and plenty of espresso, bitter chocolate and tar on the palate and finish, with tannins to lose on the back end. The wine may open up and the fruit ascend more in a year; lovers of heavy toast can eagerly drink it up now.
Predicted:  12.631305 Actual:  18.0 

Dry and oaky, this Chardonnay offers herb-inflected pineapple and pear flavors.
Predicted:  25.0863 Actual:  60.0 

Pretty rustic even for a Petite Sirah, with big, edgy tannins and a scoury mouthfeel. But it shows rich flavors of cherries, berries, cola, bacon and sandalwood, with a spicy, peppery finish. Ready to drink now.
Predicted:  29.00429 Actual:  28.0 

Th

In [24]:
# Compare the average difference between actual price and the model's predicted price
print('Average prediction difference: ', diff / num_predictions)

Average prediction difference:  7.428834748268128


In [0]:
from sklearn.metrics import r2_score

In [26]:
r2_score(labels_test, predictions)

0.5615770185633517

R^2 没有特别高。