# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

## Step1: Load the given dataset  

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [0]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [0]:
project_path = "/content/drive/My Drive/Fake News Challenge/"

### Loading the Glove Embeddings
The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.

"glove.6B.zip" is already provided to us; we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.

[Relevant article](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

In [0]:
from zipfile import ZipFile
with ZipFile(project_path + 'glove.6B.zip', 'r') as z:
  z.extractall()

# Load the dataset [5 Marks]

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [0]:
import pandas as pd
import os
train_bodies = os.path.join(project_path, 'train_bodies.csv')
train_stances = os.path.join(project_path, 'train_stances.csv')
df_tb = pd.read_csv(train_bodies)
df_ts = pd.read_csv(train_stances)
print(df_tb.head())
print(df_ts.head())

   Body ID                                        articleBody
0        0  A small meteorite crashed into a wooded area i...
1        4  Last week we hinted at what was to come as Ebo...
2        5  (NEWSER) – Wonder how long a Quarter Pounder w...
3        6  Posting photos of a gun-toting child online, I...
4        7  At least 25 suspected Boko Haram insurgents we...
                                            Headline  Body ID     Stance
0  Police find mass graves with at least '15 bodi...      712  unrelated
1  Hundreds of Palestinians flee floods in Gaza a...      158      agree
2  Christian Bale passes on role of Steve Jobs, a...      137  unrelated
3  HBO and Apple in Talks for $15/Month Apple TV ...     1034  unrelated
4  Spider burrowed through tourist's stomach and ...     1923   disagree


In [0]:
df_tb.shape, df_ts.shape

((1683, 2), (49972, 3))

In [0]:
df_ts[df_ts['Body ID'] == 158].head()

Unnamed: 0,Headline,Body ID,Stance
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
3107,It's 'rubbish' that Robert Plant turned down £...,158,unrelated
6392,Robert Plant ripped up $800M Led Zeppelin reun...,158,unrelated
8059,ISIS Militant “Jihadi John” Identified As Youn...,158,unrelated
11688,Claim: Comcast Got Complaining Customer Fired ...,158,unrelated


In [0]:
dataset = pd.merge(df_tb, df_ts, on='Body ID')


<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [0]:
dataset.head()

Unnamed: 0,Body ID,articleBody,Headline,Stance
0,0,A small meteorite crashed into a wooded area i...,"Soldier shot, Parliament locked down after gun...",unrelated
1,0,A small meteorite crashed into a wooded area i...,Tourist dubbed ‘Spider Man’ after spider burro...,unrelated
2,0,A small meteorite crashed into a wooded area i...,Luke Somers 'killed in failed rescue attempt i...,unrelated
3,0,A small meteorite crashed into a wooded area i...,BREAKING: Soldier shot at War Memorial in Ottawa,unrelated
4,0,A small meteorite crashed into a wooded area i...,Giant 8ft 9in catfish weighing 19 stone caught...,unrelated


In [0]:
dataset.shape

(49972, 4)

## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [0]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2

### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [0]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Tokenizing the text and loading the pre-trained Glove word embeddings for each token  [5 marks] 

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [0]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [0]:
toknzr = Tokenizer(num_words=MAX_NB_WORDS)

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [0]:
txt = dataset['Headline'].append(dataset['articleBody'])
print("Total documents = " , len(txt))
toknzr.fit_on_texts(txt.values)

Total documents =  99944


In [0]:
list(toknzr.word_counts.items())[:5]

[('soldier', 3582),
 ('shot', 10784),
 ('parliament', 6992),
 ('locked', 211),
 ('down', 11448)]

In [0]:
list(toknzr.word_docs.items())[:5]

[('gunfire', 1560),
 ('after', 28779),
 ('war', 6241),
 ('shot', 5780),
 ('memorial', 1782)]

In [0]:
list(toknzr.word_index.items())[0:5]

[('the', 1), ('to', 2), ('a', 3), ('of', 4), ('in', 5)]

In [0]:
toknzr.document_count

99944

#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [0]:
from nltk import sent_tokenize
texts = dataset['articleBody'].values
articles = [sent_tokenize(t) for t in texts]

## Check 2:

first element of texts and articles should be as given below. 

In [0]:
texts[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

In [0]:
articles[0]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

In [0]:
len(articles)

49972

# Now iterate through each article and each sentence to encode the words into ids using t.word_index  [5 marks] 

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

In [0]:
import numpy as np
from keras.preprocessing.text import text_to_word_sequence
NUM_ARTICLES = len(articles)
data = np.zeros(shape=(NUM_ARTICLES, MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')

article_counter = 0
for article in articles:
  sent_counter = 0
  #print("Sentences in article = ", len(article))
  if (len(article) > MAX_SENTS):
    article = article[0:MAX_SENTS]
  for sent in article:
      vals = np.array([toknzr.word_index[word] for word in text_to_word_sequence(sent)])
      if vals.shape[0] > MAX_SENT_LENGTH:
        vals = vals[0:MAX_SENT_LENGTH]
      data[article_counter, sent_counter, 0:vals.shape[0]] = vals
      sent_counter += 1
  article_counter += 1
print('Done')

Done


In [0]:
data.shape

(49972, 20, 20)

### Check 3:

Accessing first element in data should give something like given below.

In [0]:
data[0, :, :]

array([[    3,   481,   427,  7211,    81,     3,  3734,   331,     5,
         3892,   350,     4,  1431,  2960,     1,    89,    12,   466,
            0,     0],
       [  758,    95,  1047,     3,  2679,  1752,     7,   189,     3,
         1217,  1075,  2030,   700,   159,     1,  3033,   448,     1,
          555,   235],
       [   89,  1068,  4117,  2349,    12,     3,  1092,  3307,    19,
            1,    89,     2,  1793,     1,   521,  2009,    15,     9,
            3,  3111],
       [  181,  3641,   972,   200,  2558,    44,  6776,  1722,  1252,
            5, 13324, 17943,     1,   778,    31,   740,  3991,    67,
           85,     0],
       [ 2349,    12,  1557,    38,  1094,   351,   775,     2,   367,
          260,  1770,     5,  4455,    70,   494,     0,     0,     0,
            0,     0],
       [    1,   700,   189,    19,     1,   427,    32,     3,  7423,
            4,  2159,  1252,     6,     3,  5271,     4,  1217,  1252,
           12,  3365],
       [  

# Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. [5 marks] 

In [0]:
texts_heading = dataset['Headline'].values
articles_heading = [sent_tokenize(t) for t in texts_heading]

NUM_HEADLINES = len(articles_heading)
data_heading = np.zeros(shape=(NUM_HEADLINES, MAX_SENTS, MAX_SENT_LENGTH), dtype='int32')

article_counter = 0
for article in articles_heading:
  sent_counter = 0
  #print("Sentences in article = ", len(article))
  if (len(article) > MAX_SENTS):
    article = article[0:MAX_SENTS]
  for sent in article:
      vals = np.array([toknzr.word_index[word] for word in text_to_word_sequence(sent)])
      if vals.shape[0] > MAX_SENT_LENGTH:
        vals = vals[0:MAX_SENT_LENGTH]
      data_heading[article_counter, sent_counter, 0:vals.shape[0]] = vals
      sent_counter += 1
  article_counter += 1
print('Done')

Done


In [0]:
data_heading[0,:,:]

array([[  717,   206,   343,  7118,   193,    34,  1338, 11495,    21,
          233,   686,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [    0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0],
       [  

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

In [0]:
labels = pd.get_dummies(dataset['Stance'])
labels.shape

(49972, 4)

### Check 4:

The shape of data and labels shoould match the given below numbers.

In [0]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (49972, 20, 20)
Shape of label tensor: (49972, 4)


### Shuffle the data

In [0]:
## get numbers upto no.of articles
indices = np.arange(data.shape[0])
## shuffle the numbers
np.random.shuffle(indices)

In [0]:
labels.iloc[[0, 4, 5]]

Unnamed: 0,agree,disagree,discuss,unrelated
0,0,0,0,1
4,0,0,0,1
5,0,0,0,1


In [0]:
## shuffle the data
data = data[indices]
data_heading = data_heading[indices]
## shuffle the labels according to data
labels = labels.iloc[indices,:]

In [0]:
labels[0:5]

Unnamed: 0,agree,disagree,discuss,unrelated
30800,0,0,0,1
19991,0,0,0,1
16674,0,0,0,1
47197,0,0,0,1
16301,0,0,0,1


### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x-heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.



In [0]:
idx_80_perc = round(len(data) * 80 / 100)
x_train = data[0:idx_80_perc]
x_val = data[idx_80_perc:]

y_train = labels[0:idx_80_perc]
y_val = labels[idx_80_perc:]

idx_80_perc = round(len(data_heading) * 80 / 100)
x_heading_train = data[0:idx_80_perc]
x_heading_val = data[idx_80_perc:] 

### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [0]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

print(x_heading_train.shape)
print(x_heading_val.shape)

(39978, 20, 20)
(39978, 4)
(9994, 20, 20)
(9994, 4)
(39978, 20, 20)
(9994, 20, 20)


In [0]:
# Earlier tried to initialize vocab_size with num_words, but vocab_size is not really controlled by toknzr.num_word
# vocab_size = toknzr.num_words
# Hence changed vocab_size to below
vocab_size = len(toknzr.word_index.items())

### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [0]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))

embedding_vector_missing = []
for word, i in toknzr.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector
	else:
		embedding_vector_missing.append(word)
	
print(f'Embedding Vector missing for total {len(embedding_vector_missing)} words.')

Loaded 400000 word vectors.
Embedding Vector missing for total 6781 words.


# Try the sequential model approach and report the accuracy score. [10 marks]  

### Import layers from Keras to build the model

In [0]:
x = x_train
x_train = np.reshape(x, (x.shape[0], (x.shape[1] * x.shape[2])))
x = x_val
x_val = np.reshape(x, (x.shape[0], (x.shape[1] * x.shape[2])))

print(x_train.shape)
print(x_val.shape)

(39978, 400)
(9994, 400)


In [0]:
x = x_heading_train
x_heading_train = np.reshape(x, (x.shape[0], (x.shape[1] * x.shape[2])))

x = x_heading_val
x_heading_val = np.reshape(x, (x.shape[0], (x.shape[1] * x.shape[2])))


In [0]:
print(x_heading_train.shape)
print(x_heading_val.shape)

(39978, 400)
(9994, 400)


In [0]:
x_t = np.hstack((x_train, x_heading_train))
x_v = np.hstack((x_val, x_heading_val))

In [0]:
print(x_t.shape)
print(x_v.shape)

(39978, 800)
(9994, 800)


In [0]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Bidirectional
from keras.callbacks import ReduceLROnPlateau

### Model using Simple LSTM

In [0]:
# define model
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=x_t.shape[1], trainable=False)
model.add(e)
model.add(LSTM(100, dropout=0.1, recurrent_dropout=0.1))
#model.add(Flatten())
model.add(Dense(4, activation='softmax'))
# compile the model
lr_reduce = ReduceLROnPlateau(monitor='val_acc', factor=0.1, epsilon=1e-5, patience=10, verbose=1)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

# fit the model
model.fit(x_t, y_train, validation_data=(x_v, y_val), epochs=50, verbose=1, callbacks=[lr_reduce], batch_size=1024)
# evaluate the model
loss, accuracy = model.evaluate(x_v, y_val, verbose=0)
print('Accuracy: %f' % (accuracy*100))











Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 800, 100)          2787300   
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 404       
Total params: 2,868,104
Trainable params: 80,804
Non-trainable params: 2,787,300
_________________________________________________________________
None
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where






Train on 39978 samples, validate on 9994 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Accuracy: 71.492896


### Model using Bidirectional LSTM

In [45]:
# define model
model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=x_t.shape[1], trainable=False)
model.add(e)
model.add(Bidirectional(LSTM(128, dropout=0.1, recurrent_dropout=0.1)))
model.add(Dense(4, activation='softmax'))
# compile the model
lr_reduce = ReduceLROnPlateau(monitor='val_acc', factor=0.1, epsilon=1e-5, patience=10, verbose=1)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

# fit the model
model.fit(x_t, y_train, validation_data=(x_v, y_val), epochs=50, verbose=1, callbacks=[lr_reduce], batch_size=1024)
# evaluate the model
loss, accuracy = model.evaluate(x_v, y_val, verbose=0)
print('Accuracy: %f' % (accuracy*100))



Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 800, 100)          2787300   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 256)               234496    
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 1028      
Total params: 3,022,824
Trainable params: 235,524
Non-trainable params: 2,787,300
_________________________________________________________________
None
Train on 39978 samples, validate on 9994 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50

    So it appears that Bidirectional LSTM gives way better results than Simple LSTM

## Build the same model with attention layers included for better performance (Optional)

## Fit the model and report the accuracy score for the model with attention layer (Optional)

# Extra - Understanding Keras Embedding Layer

## How Keras Embedding Layer can be used to create word embedding. 
A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

The position of a word in the learned vector space is referred to as its embedding.

Two popular examples of methods of learning word embeddings from text include:

Word2Vec.
GloVe.
In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.

[Ref](https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/)

In [46]:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

[[19, 19], [23, 43], [46, 16], [21, 43], [33], [6], [19, 16], [22, 23], [19, 43], [8, 46, 19, 45]]
[[19 19  0  0]
 [23 43  0  0]
 [46 16  0  0]
 [21 43  0  0]
 [33  0  0  0]
 [ 6  0  0  0]
 [19 16  0  0]
 [22 23  0  0]
 [19 43  0  0]
 [ 8 46 19 45]]
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 89.999998
