Notes for tomorrow...

I think it's a little too complex to use word vectors from SpaCy with TensorFlow/Keras for this course. It isn't hard, just more work than we should expect students to do. And too advanced for them without some deep learning & programming skills.

So. I think maybe we shouldn't use SpaCy here and instead learn the embeddings so we can use TF for the entire thing. 

In [4]:
import numpy as np
import pandas as pd
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

In [5]:
review_data = pd.read_csv('../input/yelp_ratings.csv', index_col=0)
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


In [6]:
doc = nlp(review_data.iloc[2].text)

In [7]:
np.stack([token.vector for token in doc]).shape

(126, 300)

### Exercise: get the word vectors

In [8]:
reviews = []
with nlp.disable_pipes():
    for idx, review in review_data[:1000].iterrows():
        reviews.append(np.stack([token.vector for token in nlp(review.text)[:100]]))

In [12]:
a = np.array([len(each) for each in reviews])

In [13]:
np.percentile(a, 50)

91.0

In [95]:
embeddings = np.zeros((len(reviews), 100, 300))

In [96]:
for i, vectors in enumerate(reviews):
    embeddings[i, -len(vectors):] = vectors

Turns out saving 100 word embeddings per review gets us a 10 GB array. I can 

In [97]:
np.save('../input/embeddings.npy', embeddings)

### Exercise: Define the model

If you didn't already have word embeddings from SpaCy, you'd use an `Embedding` layer here. If you'd like to learn more about embeddings, see our Embeddings mini-course.

In [44]:
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [55]:
out = model(a.reshape(1, *a.shape))
print(out.numpy())

[[0.47405446]]


In [56]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [76]:
history = model.fit(x=embeddings, y=review_data.sentiment[:100].values, epochs=10)

Train on 100 samples
Epoch 1/10
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Index(['text', 'stars', 'sentiment'], dtype='object')