## GloVe （Global Vectors for Word Representation）とは

https://nlp.stanford.edu/projects/glove/

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

http://nonbiri-tereka.hatenablog.com/entry/2015/10/25/223430

GloVeとは、Global Vectors for Word Representationの略です。その名の通り、ワードを表現する大域的な特徴ベクトルを計算します。単語をD次元ベクトルに変換することができ、言葉と言葉の距離の計算を可能とします。

In [17]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

#from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [4]:
import json
recipeRaw = pd.read_json("./data/train.json")
recipeRaw["ingredientsFlat"] = recipeRaw["ingredients"].apply(lambda x: ' '.join(x))
recipeRaw.head()

Unnamed: 0,id,cuisine,ingredients,ingredientsFlat
0,10259,greek,"[romaine lettuce, black olives, grape tomatoes...",romaine lettuce black olives grape tomatoes ga...
1,25693,southern_us,"[plain flour, ground pepper, salt, tomatoes, g...",plain flour ground pepper salt tomatoes ground...
2,20130,filipino,"[eggs, pepper, salt, mayonaise, cooking oil, g...",eggs pepper salt mayonaise cooking oil green c...
3,22213,indian,"[water, vegetable oil, wheat, salt]",water vegetable oil wheat salt
4,13162,indian,"[black pepper, shallots, cornflour, cayenne pe...",black pepper shallots cornflour cayenne pepper...


In [6]:
recipeRawTest = pd.read_json("./data/test.json")
recipeRawTest["ingredientsFlat"] = recipeRawTest["ingredients"].apply(lambda x: ' '.join(x))
testdocs = recipeRawTest["ingredientsFlat"].values
recipeRawCombined = recipeRaw.append(recipeRawTest)
recipeRawCombined[40000:].head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,


Unnamed: 0,cuisine,id,ingredients,ingredientsFlat
226,,14989,"[garlic cloves, chicken, dry white wine, fresh...",garlic cloves chicken dry white wine fresh bas...
227,,13270,"[egg whites, all-purpose flour, eggs, almond e...",egg whites all-purpose flour eggs almond extra...
228,,25529,"[tomato sauce, refrigerated pizza dough, olive...",tomato sauce refrigerated pizza dough olive oi...
229,,12140,"[kosher salt, whole milk yoghurt, heavy cream,...",kosher salt whole milk yoghurt heavy cream cay...
230,,23127,"[avocado, pork, lime, flour, shredded sharp ch...",avocado pork lime flour shredded sharp cheddar...


In [7]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(recipeRaw["cuisine"].values)
list(le.classes_)

['brazilian',
 'british',
 'cajun_creole',
 'chinese',
 'filipino',
 'french',
 'greek',
 'indian',
 'irish',
 'italian',
 'jamaican',
 'japanese',
 'korean',
 'mexican',
 'moroccan',
 'russian',
 'southern_us',
 'spanish',
 'thai',
 'vietnamese']

In [8]:
from keras.utils.np_utils import to_categorical
docs = recipeRaw["ingredientsFlat"].values
testdocs = recipeRawTest["ingredientsFlat"].values
docsCombined = recipeRawCombined["ingredientsFlat"].values
labels_enc = le.transform(recipeRaw["cuisine"].values)
labels = to_categorical(labels_enc)
labels

Using TensorFlow backend.


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [9]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docsCombined)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)
encoded_test_docs = t.texts_to_sequences(testdocs)
print(vocab_size)
# pad documents to a max length of 4 words
max_length = 40
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
padded_test_docs = pad_sequences(encoded_test_docs, maxlen=max_length, padding='post')
print(len(padded_docs))

3185
39774


In [26]:
# didn't work

# load the whole embedding into memory
#embeddings_index = dict()
#f = open('./data/glove.6B.50d.txt')
#for line in f:
#    values = line.split()
#    word = values[0]
#    coefs = np.asarray(values[1:], dtype='float32')
#    embeddings_index[word] = coefs
#f.close()
#print('Loaded %s word vectors.' % len(embeddings_index))

In [23]:
# GloVe 辞書の読み込み --> Jigsaw Toxic Comments の notebook より
# https://pycarnival.com/dict/

path = './data/'
EMBEDDING_FILE=f'{path}glove.6B.100d.txt'
# Glove の Word Vector (https://www.kaggle.com/watts2/glove6b50dtxt)

def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
# 単一の星*は配列/コレクションを位置引数に展開
# 「* x」引数を取る関数を定義することで、宣言することなく多数のオプションパラメータを指定することができる。
# https://codeday.me/jp/qa/20181122/11424.html

import json
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(EMBEDDING_FILE, encoding="utf-8_sig"))
# split() カンマ区切り文字列を分割、空白を削除しリスト化

# UnicodeDecodeError: 'cp932' codec can't decode byte 0x93 in position 3136: illegal multibyte sequence を回避するため、
# open(EMBEDDING_FILE) --> open(EMBEDDING_FILE, encoding="utf-8_sig")

In [24]:
vocab = pd.DataFrame.from_dict(t.word_index,orient="index")
vocab.drop([0],axis=1).reset_index().rename(columns={"index":"word"}).to_csv("vocab.csv",index=False)

In [25]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in t.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [27]:
print(embedding_matrix.shape)

(3185, 100)


In [28]:
from keras.models import Sequential
from keras.layers import Dense, Flatten
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.embeddings import Embedding
from sklearn.model_selection import KFold

# fix random seed for reproducibility
seed = 42
np.random.seed(seed)

# define 10-fold cross validation test harness
kfold = KFold(n_splits=5, shuffle=True, random_state=seed)

cvscores = []
for train, test in kfold.split(padded_docs, labels):
    # define the model

    model = Sequential()
    model.add(Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=40, trainable=False))
    model.add(Conv1D(filters=100, kernel_size=3, padding='same', activation='relu'))
    model.add(MaxPooling1D(pool_size=2))
    model.add(Flatten())
    model.add(Dense(250, activation='relu'))
    model.add(Dense(le.classes_.size, activation='sigmoid'))
    # compile the model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
    # summarize the model
    if cvscores == []:
        print(model.summary())
    # fit the model
    model.fit(padded_docs[train], labels[train], epochs=5, verbose=0)
    scores = model.evaluate(padded_docs[test], labels[test], verbose=0)
    print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 40, 100)           318500    
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 40, 100)           30100     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 20, 100)           0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 2000)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               500250    
_________________________________________________________________
dense_2 (Dense)              (None, 20)                5020      
Total params: 853,870
Trainable params: 535,370
Non-trainable params: 318,500
________________________________________________________________

In [29]:
predictions = model.predict(padded_test_docs)
print(predictions.shape)
recipeRawTest["cuisine"] = [le.classes_[np.argmax(prediction)] for prediction in predictions]
recipeRawTest.head()
recipeRawTest.drop(["ingredients","ingredientsFlat"],axis=1).to_csv("GloVe_submission.csv",index=False)

(9944, 20)
