### Word Embedding
A word embedding is a class of approaches for representing words and documents using a dense vector representation.

Traditional bag of words approach are sparse matrix and more memory comsuming, to overcome is Word embedding are used.In word embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

The position of a word in the learned vector space is referred to as its embedding.

Two popular examples of methods of learning word embeddings from text include:
1) Glove 

2) Word2Vec

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

**input_dim**: Size of the vocabulary in the text data (For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.)

**output_dim**: Size of the vector space in which words are embedded(t defines the size of the output vectors from this layer for each word.)

**input_length**: Length of the input sequence(For example, if all of your input documents are comprised of 1000 words, this would be 1000.)

In [None]:
#https://stackabuse.com/python-for-nlp-word-embeddings-for-deep-learning-in-keras/
#https://towardsdatascience.com/deep-learning-4-embedding-layers-f9a02d55ac12
#https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
#https://medium.com/logivan/neural-network-embedding-and-dense-layers-whats-the-difference-fa177c6d0304

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

In [9]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

In [3]:
#Chnage work directory
import os
os.chdir('C:/Users/SJ/Desktop/NLP/Sentiment labeled datasets/sentiment labelled sentences')

In [4]:
#load data
imdb_data=pd.read_csv('imdb_labelled.txt',delimiter='\t',header=None)
amazon_data=pd.read_csv('amazon_cells_labelled.txt',delimiter='\t',header=None)
yelp_data=pd.read_csv('yelp_labelled.txt',delimiter='\t',header=None)

In [5]:
#combine all data
data=pd.concat([imdb_data,amazon_data,yelp_data])
data.columns=['Reviews','Reviews_class']
data.head(2)

Unnamed: 0,Reviews,Reviews_class
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0


In [6]:
#Check shape
data.shape

(2748, 2)

In [10]:
docs = data['Reviews'].values.tolist()
docs

['A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  ',
 'Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.  ',
 'Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.  ',
 'Very little music or anything to speak of.  ',
 'The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.  ',
 "The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.  ",
 'Wasted two hours.  ',
 'Saw the movie today and thought it was a good effort, good messages for kids.  ',
 'A bit predictable.  ',
 'Loved the casting of Jimmy Buffet as the science teacher.  ',
 'And those baby owls were adorable.  ',
 "The movie showed a lot of Florida at it's best, made it look very appealing.  ",
 'The Son

In [11]:
# integer encode the documents
vocab_size = 1000
encoded_docs = [one_hot(d, vocab_size,filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',lower=True,) for d in docs]
print(encoded_docs)

[[505, 798, 798, 798, 947, 816, 518, 272, 447, 505, 27, 707, 419, 874], [66, 393, 466, 329, 363, 134, 515, 550, 638, 904, 515, 18, 62, 411, 856, 899, 285, 264], [855, 969, 549, 425, 254, 222, 766, 540, 217, 515, 272, 851, 152, 576, 363, 798, 154, 515, 818, 329, 850, 222, 515, 503, 222, 710, 424, 578, 160], [798, 379, 591, 904, 56, 636, 896, 856], [515, 390, 446, 717, 515, 272, 329, 599, 482, 734, 346, 636, 996, 505, 749, 123, 459, 843, 11, 252, 997], [515, 46, 856, 515, 272, 61, 238, 982, 339, 298, 761, 447, 216, 221, 288, 456, 775, 604, 761, 626], [285, 512, 917], [385, 515, 272, 224, 222, 578, 221, 329, 505, 879, 402, 879, 428, 634, 635], [505, 175, 343], [680, 515, 579, 856, 8, 828, 154, 515, 299, 302], [222, 193, 653, 644, 365, 382], [515, 272, 324, 505, 133, 856, 320, 776, 761, 390, 68, 221, 809, 798, 528], [515, 227, 365, 515, 390, 222, 515, 547, 365, 360, 293], [221, 329, 360, 213], [768, 734, 505, 798, 822, 144, 90, 272, 123, 432, 964, 424, 822, 717, 583, 678], [221, 244, 807, 

In [15]:
#sequence have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length.
max_length=100
padded_sequences=pad_sequences(encoded_docs,maxlen=max_length,padding='post')
print(padded_sequences)

[[505 798 798 ...   0   0   0]
 [ 66 393 466 ...   0   0   0]
 [855 969 549 ...   0   0   0]
 ...
 [466 456 329 ...   0   0   0]
 [515 720 744 ...   0   0   0]
 [378 154 298 ...   0   0   0]]


In [18]:
labels=data['Reviews_class'].values

In [17]:
#Define model
model=Sequential()
model.add(Embedding(input_dim=vocab_size+1,output_dim=64,input_length=max_length))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))
#Compile the model
model.compile(loss='binary_crossentropy',metrics=['accuracy'],optimizer='adam')

#print model summary
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 64)           64064     
_________________________________________________________________
flatten (Flatten)            (None, 6400)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 6401      
Total params: 70,465
Trainable params: 70,465
Non-trainable params: 0
_________________________________________________________________
None


In [23]:
from sklearn.model_selection import train_test_split
xtrain,xtest,ytrain,ytest=train_test_split(padded_sequences,labels,random_state=20,test_size=0.2)

In [25]:
model.fit(xtrain,ytrain,epochs=10,validation_data=(xtest,ytest))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x16d984a8f10>

In [27]:
loss,accuracy=model.evaluate(xtest,ytest,verbose=2)
print('Accuracy: %f' % (accuracy*100))

18/18 - 0s - loss: 0.0492 - accuracy: 0.9800
Accuracy: 98.000002


In [None]:
#references
#https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/https://medium.com/logivan/neural-network-embedding-and-dense-layers-whats-the-difference-fa177c6d0304
#https://medium.com/logivan/neural-network-embedding-and-dense-layers-whats-the-difference-fa177c6d0304