# Word Embeddings with Keras

Using the News Dataset from Kaggle (Link in Readme) to create 50-Dimensional Word embeddings using Keras

In [14]:
import keras
from keras.models import Sequential
from keras.layers import Embedding
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
import re
import json
import nltk

Reading the Dataset, consisting of nested dictionaries or .Json files and storing all the dictionaries in a list.

In [7]:
data=[]
for line in open('News_data.json','r'):
    data.append(json.loads(line))

In [8]:
data[:2]

[{'category': 'CRIME',
  'headline': 'There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV',
  'authors': 'Melissa Jeltsen',
  'link': 'https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89',
  'short_description': 'She left her husband. He killed their children. Just another day in America.',
  'date': '2018-05-26'},
 {'category': 'ENTERTAINMENT',
  'headline': "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song",
  'authors': 'Andy McDonald',
  'link': 'https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201',
  'short_description': 'Of course it has a song.',
  'date': '2018-05-26'}]

We will be using only the headings of the first 10,000 news articles to create the word embeddings

In [9]:
data=data[:10000]

In [10]:
sent=[]
for news in data:
    x=re.sub('[^a-zA-Z]',' ',news['headline']) #To remove symbols and numbers from headings
    sent.append(x)

In [12]:
sent[5:10]

['Morgan Freeman  Devastated  That Sexual Harassment Claims Could Undermine Legacy',
 'Donald Trump Is Lovin  New McDonald s Jingle In  Tonight Show  Bit',
 'What To Watch On Amazon Prime That s New This Week',
 'Mike Myers Reveals He d  Like To  Do A Fourth Austin Powers Film',
 'What To Watch On Hulu That s New This Week']

Setting vocabulary size = 10,000 and converting the words into their One Hot Encoding Representations

In [16]:
vocab_size=10000

In [17]:
ohe_sent=[]
for i in sent:
    ohe_sent.append(one_hot(i,vocab_size))

In [20]:
print(ohe_sent[-1])
print(ohe_sent[0])

[675, 4873, 8559, 48, 8933, 1768, 6889, 9585, 2186, 675, 1051, 8590, 2991, 7955]
[1268, 1528, 4101, 88, 4144, 7287, 9607, 7211, 625, 3392, 797, 8933]


Calculating the max length of any sentence in ohe_sent and padding the smaller sentences in order to have similar size vectors

In [21]:
max_len=0
for word in ohe_sent:
    max_len=max(max_len,len(word))

In [22]:
max_len

23

In [23]:
pad_sent=pad_sequences(ohe_sent,max_len)

In [25]:
print(pad_sent[0])
print(pad_sent[-1])

[   0    0    0    0    0    0    0    0    0    0    0 1268 1528 4101
   88 4144 7287 9607 7211  625 3392  797 8933]
[   0    0    0    0    0    0    0    0    0  675 4873 8559   48 8933
 1768 6889 9585 2186  675 1051 8590 2991 7955]


Now all the OHE representations are 23 words long

Training a model to learn the word embeddings from the padded data with RMSProp optimizer and Mean Square Error Metric

In [26]:
model=Sequential()
model.add(Embedding(vocab_size,50,input_length=max_len))
model.compile('rmsprop','mse')

The word embeddings have been created. We can check the embeddings for different sentences and words in pad_sent using model.predict

In [27]:
model.predict(pad_sent)[0]

array([[ 0.04391349,  0.0453858 , -0.02613759, ...,  0.0171277 ,
         0.02062822, -0.03409203],
       [ 0.04391349,  0.0453858 , -0.02613759, ...,  0.0171277 ,
         0.02062822, -0.03409203],
       [ 0.04391349,  0.0453858 , -0.02613759, ...,  0.0171277 ,
         0.02062822, -0.03409203],
       ...,
       [ 0.03721166,  0.00751374, -0.00661721, ...,  0.02370478,
        -0.04891087,  0.04572893],
       [-0.00341044, -0.03072641, -0.01690539, ...,  0.01867681,
         0.00623417, -0.01163029],
       [-0.04781581,  0.03976731, -0.00262574, ...,  0.04811199,
        -0.0304492 ,  0.03536019]], dtype=float32)

In [28]:
#Word embedding for legacy:

model.predict(pad_sent)[0][-1]

array([-4.7815811e-02,  3.9767314e-02, -2.6257411e-03,  2.6954006e-02,
       -4.3435849e-02,  4.3699030e-02, -3.8771033e-03,  6.6916831e-03,
        3.7492845e-02, -2.8893685e-02,  2.2647452e-02, -3.9983392e-03,
        1.0572374e-05,  2.6472211e-03,  2.6500177e-02, -1.6173888e-02,
        3.2573152e-02, -2.9897070e-02, -5.8228858e-03, -4.9736954e-02,
        3.1751577e-02, -6.8246610e-03, -8.7342747e-03,  1.8186513e-02,
        3.4542050e-02, -3.4797192e-02,  2.5794458e-02, -3.5924695e-02,
        2.1645810e-02, -7.6074488e-03,  3.6285270e-02, -3.3404946e-02,
        9.6990354e-03, -3.2703713e-02, -4.5130204e-02, -4.2662241e-02,
        1.4409233e-02, -3.2845870e-02,  4.5430247e-02, -5.8729537e-03,
        6.4907447e-03,  2.4783406e-02,  4.0559355e-02,  2.9809747e-02,
        4.7572479e-03,  1.4624227e-02,  3.6044605e-03,  4.8111986e-02,
       -3.0449200e-02,  3.5360191e-02], dtype=float32)

In [29]:
#Word embedding for Week:

model.predict(pad_sent)[2][-1]

array([ 0.03197933,  0.00617479,  0.01134831,  0.01664588,  0.04026793,
        0.0468074 , -0.04918547, -0.02131295, -0.04805043,  0.04024074,
        0.04140475, -0.01917273,  0.03136387, -0.00160036,  0.03116623,
       -0.01072866, -0.00057683, -0.04159043, -0.03906962,  0.01633375,
       -0.04855192, -0.02713817, -0.03227453,  0.01677359,  0.01343079,
        0.02372496, -0.02880554,  0.00980498,  0.01453947,  0.04585406,
       -0.03103371, -0.02340667, -0.00404655, -0.01873124,  0.00546547,
        0.03536229,  0.04703805,  0.00828092, -0.03963355,  0.01348083,
        0.0440304 ,  0.04419577,  0.00697362,  0.00218406,  0.01864007,
        0.00480332, -0.00638233,  0.01960769,  0.00129684, -0.0199906 ],
      dtype=float32)

These embeddings can be used as features to further train models.