# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

## Step1: Load the given dataset  

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [0]:
from google.colab import drive


In [2]:
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive/


#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [0]:
project_path = "/content/drive/My Drive/Colab Notebooks/Advanced NLP/fake news detection/"

In [0]:
pwd

'/content'

### Loading the Glove Embeddings

In [4]:
print("parsing data to required format takes time, So i have taken only 1000 data set")

parsing data to required format takes time, So i have taken only 1000 data set


In [0]:
from zipfile import ZipFile
with ZipFile(project_path+'glove.6B.zip', 'r') as z:
  z.extractall()

# Load the dataset [5 Marks]

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [0]:
import pandas as pd
times_now_bodies = pd.read_csv(project_path+'train_bodies.csv')
times_now_stances = pd.read_csv(project_path+'train_stances.csv')

In [7]:
times_now_bodies.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) – Wonder how long a Quarter Pounder w...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...


In [8]:
times_now_stances.head()

Unnamed: 0,Headline,Body ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


In [0]:
dataset = pd.merge(times_now_stances, times_now_bodies, on ='Body ID')


<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [10]:
dataset.head()

Unnamed: 0,Headline,Body ID,Stance,articleBody
0,Police find mass graves with at least '15 bodi...,712,unrelated,Danny Boyle is directing the untitled film\n\n...
1,Seth Rogen to Play Apple’s Steve Wozniak,712,discuss,Danny Boyle is directing the untitled film\n\n...
2,Mexico police find mass grave near site 43 stu...,712,unrelated,Danny Boyle is directing the untitled film\n\n...
3,Mexico Says Missing Students Not Found In Firs...,712,unrelated,Danny Boyle is directing the untitled film\n\n...
4,New iOS 8 bug can delete all of your iCloud do...,712,unrelated,Danny Boyle is directing the untitled film\n\n...


In [0]:
dataset.sort_values(["Body ID"], axis=0, 
                 ascending=True, inplace=True) 
dataset = dataset[:1000]

In [0]:
dataset.head()

Unnamed: 0,Headline,Body ID,Stance,articleBody
41651,"Soldier shot, Parliament locked down after gun...",0,unrelated,A small meteorite crashed into a wooded area i...
41657,Italian catches huge wels catfish; is it a rec...,0,unrelated,A small meteorite crashed into a wooded area i...
41658,Not coming to a store near you: The pumpkin sp...,0,unrelated,A small meteorite crashed into a wooded area i...
41659,One gunman killed in shooting on Parliament Hi...,0,unrelated,A small meteorite crashed into a wooded area i...
41660,Soldier shot at war memorial in Canada,0,unrelated,A small meteorite crashed into a wooded area i...


## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [0]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2

### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [13]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Tokenizing the text and loading the pre-trained Glove word embeddings for each token  [5 marks] 

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [0]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [0]:

t = Tokenizer(num_words=MAX_NB_WORDS)


In [47]:
dataset.columns

Index(['Headline', 'Body ID', 'Stance', 'articleBody'], dtype='object')

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [0]:
t.fit_on_texts(dataset['articleBody'])


In [18]:
t.word_index.items()

dict_items([('the', 1), ('to', 2), ('of', 3), ('a', 4), ('in', 5), ('and', 6), ('that', 7), ('is', 8), ('on', 9), ('said', 10), ('has', 11), ('for', 12), ('an', 13), ('was', 14), ('not', 15), ('he', 16), ('have', 17), ('at', 18), ('it', 19), ('with', 20), ('by', 21), ('”', 22), ('from', 23), ('be', 24), ('but', 25), ('his', 26), ('as', 27), ('been', 28), ('government', 29), ('this', 30), ('are', 31), ('will', 32), ('syria', 33), ('more', 34), ('they', 35), ('could', 36), ('were', 37), ('which', 38), ('state', 39), ('after', 40), ('would', 41), ('i', 42), ('military', 43), ('also', 44), ('s', 45), ('u', 46), ('had', 47), ('—', 48), ('new', 49), ('islamic', 50), ('she', 51), ('story', 52), ('iraq', 53), ('news', 54), ('who', 55), ('if', 56), ('time', 57), ('about', 58), ('amazon', 59), ('work', 60), ('we', 61), ('obama', 62), ('or', 63), ('one', 64), ('video', 65), ('some', 66), ('even', 67), ('years', 68), ('my', 69), ('media', 70), ('up', 71), ('sources', 72), ('while', 73), ('people',

In [19]:
t.word_counts.get('the')

26328

In [0]:
t.word_docs.get('nicaraguas')

In [0]:
dataset.shape

(1000, 4)

#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [0]:
articles = []

In [0]:
for art in dataset['articleBody']:
  articles.append(art)

In [51]:
len(articles)

1000

In [52]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
def text_prepare(text):
    text = text.lower()
    text = re.sub(REPLACE_BY_SPACE_RE, " ", text)
    text = re.sub(BAD_SYMBOLS_RE, "", text)
    all_words = text.split(' ')
    list = []
    for word in all_words: 
      if word not in STOPWORDS:
        list.append(word)
    text = ' '.join(list)
    text = re.sub(' +', ' ', text)
    return text

In [0]:
from tensorflow.python.keras.preprocessing import sequence



In [0]:
import numpy as np

In [56]:
nltk.sent_tokenize(articles[0])[0]#.split(" ")


"A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday."

In [57]:
nltk.sent_tokenize(articles[0])

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

In [58]:
print("processing this 49k below takes hrs, so reducing the data size")

processing this 49k below takes hrs, so reducing the data size


In [0]:

data_prep = []
for atricles_parse in range(0,(len(articles))):
  #print(atricles_parse)
  sent_20 = []
  i =0
  for article_body in (nltk.sent_tokenize(articles[atricles_parse])):##To parse sentance
    u = text_prepare(article_body).split(" ")##To parse word, so it will flow as word instead of char
   # print(u)
    text_index = []
    i = i+1
    if i > 20:
      #print('brk')
      break      
    for text_idx,text_value in enumerate(u):
        #print('entered')
        if text_idx > 19:
          break
        for word,idx in t.word_index.items():
          if text_value.lower() == word:
            text_index.append(idx)
    #print((text_index))
    #text_index = sequence.pad_sequences(text_index,maxlen=20,padding='post')
    sent_20.append(text_index)##Appening sentance  
  
  #rint (len(sent_20))
  if len(sent_20) < 20:
    #print(len(sent_20))
    for p in range(len(sent_20),20):
      #print('Hi')
      sent_20.append(np.zeros(20).tolist())
  x = sequence.pad_sequences(sent_20,maxlen=20,padding='post')
# y = x
  #print (len(sent_20))
  data_prep.append(x.tolist())

  
  
##pad sequence works on list of list
## input format is no. of rows, time step and no. of feature total 3 Dim
## if list within list all are not same lenght then while changing to array it wont be proper

In [0]:
data_prep_arr = np.array(data_prep)

In [61]:
print('Shape of data tensor:', data_prep_arr.shape)
#print('Shape of label tensor:', labels.shape)

Shape of data tensor: (1000, 20, 20)


In [62]:
data_prep[0:1]

[[[1245,
   393,
   2027,
   2063,
   521,
   793,
   2028,
   2065,
   29,
   10,
   143,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [441,
   123,
   1246,
   2066,
   1256,
   281,
   2067,
   701,
   299,
   1237,
   318,
   200,
   122,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [29,
   649,
   2029,
   1247,
   10,
   278,
   2069,
   29,
   694,
   2030,
   2031,
   2032,
   1245,
   393,
   1406,
   423,
   676,
   2024,
   274,
   391],
  [676,
   386,
   2033,
   2058,
   599,
   703,
   2071,
   2072,
   391,
   316,
   2034,
   54,
   122,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [1247,
   10,
   1210,
   1250,
   305,
   2035,
   1687,
   421,
   1978,
   2073,
   1494,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0,
   0],
  [701,
   281,
   393,
   2074,
   1002,
   703,
   2075,
   459,
   703,
   10,
   1257,
   2076,
   2077,
   2036,
   493,
   1258,
   691,
   278,
   0,
   0],
  [10, 107, 365, 393, 2078, 2079, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

Shape of data tensor: (1000,)
Shape of label tensor: (1000, 3)


## Check 2:

first element of texts and articles should be as given below. 

In [0]:
texts[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

In [0]:
articles[0]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

# Now iterate through each article and each sentence to encode the words into ids using t.word_index  [5 marks] 

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

In [0]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Flatten, Reshape,Embedding,Concatenate,merge,Input
from keras.models import Model
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils
import pickle
from matplotlib import pyplot as plt
import seaborn as sns
from keras.layers.recurrent import LSTM
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np


In [0]:
labels = pd.get_dummies(dataset['Stance'])
#keras.utils.to_categorical(dataset['Stance'], 10)

In [65]:
!pip install Merge

Collecting Merge
  Downloading https://files.pythonhosted.org/packages/77/b7/c39602bd3d03a98ec86f5e84f1f2f07b169e3623e183041786a88c962165/merge-1.0.0.zip
Building wheels for collected packages: Merge
  Building wheel for Merge (setup.py) ... [?25l[?25hdone
  Created wheel for Merge: filename=merge-1.0.0-cp36-none-any.whl size=1493 sha256=550d83747d2da9e65228293e6bc2b03ece2d3cc4a462d654185d200f1436d5ce
  Stored in directory: /root/.cache/pip/wheels/6d/c7/7d/efe551f409cdd4572ece7ae7b9f96dacccae332fe2b1d386b3
Successfully built Merge
Installing collected packages: Merge
Successfully installed Merge-1.0.0


### Check 3:

Accessing first element in data should give something like given below.

In [0]:
data[0, :, :]

array([[    3,   487,   474,  7113,    79,     3,  3687,   325,     5,
         4200,   361,     4,  1525,  2913,     1,    89,    12,   451,
            0,     0],
       [  743,    96,  1044,     3,  2814,  1759,     7,   186,     3,
         1219,  1070,  1987,   736,   154,     1,  2990,   458,     1,
          543,   232],
       [   89,  1052,  4057,  2314,    12,     3,  1073,  3248,    19,
            1,    89,     2,  1751,     1,   518,  1980,    15,     9,
            3,  2879],
       [  182,  3691,   976,   196,  2515,    42,  6688,  1691,  1227,
            5, 13011, 17379,     1,   762,    30,   722,  3931,    66,
           87,     0],
       [ 2314,    12,  1882,    38,  1076,   346,   793,     2,   356,
          261,  1782,     5,  4396,    67,   486,     0,     0,     0,
            0,     0],
       [    1,   736,   186,    19,     1,   474,    32,     3,  7307,
            4,  2122,  1227,     6,     3,  5195,     4,  1219,  1227,
           12,  3308],
       [  

In [0]:
h = Tokenizer(num_words=MAX_NB_WORDS)

In [0]:
h.fit_on_texts(dataset['Headline'])

In [0]:
head_line =[]
for hdln in dataset['Headline']:
  head_line.append(hdln)

# Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. [5 marks] 

In [0]:
 data_prep_headline = []
for atricles_parse in range(0,(len(dataset['Headline'].tolist()))):
  #print(atricles_parse)
  sent_20 = []
  i =0
  for article_body in (nltk.sent_tokenize(head_line[atricles_parse])):##To parse sentance
    #print(article_body)
    u = text_prepare(article_body).split(" ")##To parse word, so it will flow as word instead of char
   # print(u)
    text_index = []
    i = i+1
    if i > 20:
      #print('brk')
      break      
    for text_idx,text_value in enumerate(u):
        #print('entered')
        if text_idx > 19:
          break
        for word,idx in h.word_index.items():
          if text_value.lower() == word:
            text_index.append(idx)
    #print((text_index))
    #text_index = sequence.pad_sequences(text_index,maxlen=20,padding='post')
    sent_20.append(text_index)
    
  
  #rint (len(sent_20))
  if len(sent_20) < 19:
    #print(len(sent_20))
    for p in range(len(sent_20),20):
      #print('Hi')
      sent_20.append(np.zeros(20).tolist())
  x = sequence.pad_sequences(sent_20,maxlen=20,padding='post')
# y = x
  #print (len(sent_20))
  data_prep_headline.append(x.tolist())
                   

In [0]:
head_lines = np.array(data_prep_headline)

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

In [71]:
print('Shape of data tensor:', head_lines.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (1000, 20, 20)
Shape of label tensor: (1000, 3)


### Check 4:

The shape of data and labels shoould match the given below numbers.

In [74]:
print('Shape of data tensor:', data_prep_arr.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (1000, 20, 20)
Shape of label tensor: (1000, 3)


### Shuffle the data

In [0]:
## get numbers upto no.of articles
indices = np.arange(data_prep_arr.shape[0])
## shuffle the numbers
np.random.shuffle(indices)

In [0]:
labels = np.array(labels)

In [0]:
## shuffle the data
data = data_prep_arr[indices]
data_heading = head_lines[indices]
## shuffle the labels according to data
labels = labels[indices]

In [0]:
type(head_lines)

numpy.ndarray

### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x-heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.



In [0]:
X_train, X_val, y_train, y_val = train_test_split(data_prep_arr, labels, test_size=0.20, random_state=1)
X_train_heading, X_val_heading, y_train_heading, y_val_heading = train_test_split(head_lines, labels, test_size=0.20, random_state=1)

### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [79]:
print(X_train.shape)
print(X_val.shape)

print(y_train.shape)
print(y_val.shape)

(800, 20, 20)
(200, 20, 20)
(800, 3)
(200, 3)


In [80]:
print("Embedding for body")

Embedding for body


In [0]:

MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2
#MAX_SENTS = Maximum no.of sentences to consider in an article.

#MAX_SENT_LENGTH = Maximum no.of words to consider in a sentence.

#MAX_NB_WORDS = Maximum no.of words in the total vocabualry.

#MAX_SENTS_HEADING = Maximum no.of sentences to consider in a heading of an article.

In [0]:
t = Tokenizer(num_words=MAX_NB_WORDS)
t.fit_on_texts(dataset['articleBody'])


In [0]:
dataset.columns

Index(['Headline', 'Body ID', 'Stance', 'articleBody'], dtype='object')

### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [87]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix_body = np.zeros((MAX_NB_WORDS, 100))


for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix_body[i] = embedding_vector

Loaded 400000 word vectors.


In [0]:
print("Embedding for headline")

In [0]:
t = Tokenizer(num_words=MAX_NB_WORDS)
t.fit_on_texts(dataset['Headline'])

In [88]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix_headline = np.zeros((MAX_NB_WORDS, 100))


for word, i in h.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix_headline[i] = embedding_vector

Loaded 400000 word vectors.


In [0]:
# from keras.layers import Embedding

# embedding_layer = Embedding(MAX_NB_WORDS + 1,
#                             100,
#                             weights=[embedding_matrix_body],
#                             input_length=MAX_SENT_LENGTH,
#                             trainable=False)

In [0]:
# sequence_input = Input(shape=(20,), dtype='int32')

In [0]:
# embedded_sequences = embedding_layer(sequence_input)

In [0]:
# x = LSTM(10)(embedded_sequences)

In [0]:
# x = Dense(1)(x)

In [0]:
# from keras.layers import Embedding

# embedding_layer1 = Embedding(MAX_NB_WORDS + 1,
#                             100,
#                             weights=[embedding_matrix_headline],
#                             input_length=MAX_SENT_LENGTH,
#                             trainable=False)

In [0]:
# embedded_sequences1 = embedding_layer1(sequence_input)

In [0]:
# y = LSTM(10)(embedded_sequences1)

In [0]:
# y = Dense(1)(y)

In [0]:
# w = concatenate([x, y])

# # u =  Dense(3)(w)
# out =  Dense(1, activation='softmax')(w)

In [0]:
# model = Model(sequence_input, out)
# model.compile(loss='categorical_crossentropy',
#               optimizer='rmsprop',
#               metrics=['acc'])

# # happy learning!
# model.fit([X_train,X_train_heading], y_train, validation_data=0.2,
#           epochs=2, batch_size=32)

# Try the sequential model approach and report the accuracy score. [10 marks]  

### Import layers from Keras to build the model

In [0]:
from keras.layers import TimeDistributed, Bidirectional,concatenate


In [0]:

MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2
#MAX_SENTS = Maximum no.of sentences to consider in an article.

#MAX_SENT_LENGTH = Maximum no.of words to consider in a sentence.

#MAX_NB_WORDS = Maximum no.of words in the total vocabualry.

#MAX_SENTS_HEADING = Maximum no.of sentences to consider in a heading of an article.

### Model

In [0]:
                  
# headline_model = Sequential()
# headline_model.add(Embedding(MAX_NB_WORDS,###nedd to handle index between 0 and 10000
#                     100,###50 embedding
#                     input_length=MAX_SENT_LENGTH,##300 max lenght we got above
#                     weights=[embedding_matrix_headline], #Pre-trained embedding
#                     trainable=False) #We do not want to change embedding
#                    )
# headline_model.add(LSTM(5, return_sequences=False, dropout=0.1,recurrent_dropout=0.1))


In [89]:
sentance_input = Input(shape = (MAX_SENT_LENGTH,),dtype ='int32')
print(sentance_input)
embedded_sequences = Embedding(output_dim = 100,input_dim = MAX_NB_WORDS,input_length=(MAX_SENT_LENGTH,),weights=[embedding_matrix_body])(sentance_input)

Tensor("input_2:0", shape=(?, 20), dtype=int32)


In [0]:
l_lstm = Bidirectional(LSTM(100,return_sequences =True))(embedded_sequences)
l_dense = TimeDistributed(Dense(100))(l_lstm)
l_dense = Flatten()(l_dense)
sentEncoder = Model(sentance_input,l_dense)

In [93]:
body_input = Input(shape = (MAX_SENTS,MAX_SENT_LENGTH,),dtype ='int32')
print(body_input)
body_encoder = TimeDistributed(sentEncoder)(body_input)
print(body_encoder)
l_lstm_sent = Bidirectional(LSTM(100,return_sequences=True))(body_encoder)
l_dense_sent = TimeDistributed(Dense(100))(l_lstm_sent)
l_dense_sent = Flatten()(l_dense_sent)

Tensor("input_3:0", shape=(?, 20, 20), dtype=int32)
Tensor("time_distributed_2/Reshape_1:0", shape=(?, 20, 2000), dtype=float32)


In [105]:
heading_input = Input(shape = (MAX_SENTS_HEADING,MAX_SENT_LENGTH,),dtype ='int32')
print(heading_input)
heading_embedded_sequences = Embedding(output_dim = 100,input_dim = MAX_NB_WORDS,input_length=(MAX_SENTS_HEADING,MAX_SENT_LENGTH),weights=[embedding_matrix_body])(heading_input)
print(body_encoder)
h_dense = Dense(100,activation='relu')(heading_embedded_sequences)
h_flatten = Flatten()(h_dense)
article_output = concatenate([l_dense_sent,h_flatten],name = 'concatenate_heading')
           
news_vector = Dense(4,activation='relu')(article_output)
preds = Dense(3,activation='softmax')(news_vector)
model = Model([body_input,heading_input],[preds])

Tensor("input_8:0", shape=(?, 1, 20), dtype=int32)
Tensor("time_distributed_2/Reshape_1:0", shape=(?, 20, 2000), dtype=float32)


In [0]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

In [108]:
model.fit([X_train,X_train_heading],[y_train],epochs = 10,batch_size = 50)

ValueError: ignored

In [110]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 20, 20)       0                                            
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, 20, 2000)     2180900     input_3[0][0]                    
__________________________________________________________________________________________________
input_8 (InputLayer)            (None, 1, 20)        0                                            
__________________________________________________________________________________________________
bidirectional_2 (Bidirectional) (None, 20, 200)      1680800     time_distributed_2[0][0]         
__________________________________________________________________________________________________
embedding_

W0817 10:50:19.737985 140290074306432 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.



### Compile and fit the model

## Build the same model with attention layers included for better performance (Optional)

## Fit the model and report the accuracy score for the model with attention layer (Optional)