# Stance Detection for the Fake News Challenge

## Identifying Textual Relationships with Deep Neural Nets

### Check the problem context [here](https://drive.google.com/open?id=1KfWaZyQdGBw8AUTacJ2yY86Yxgw2Xwq0).

### Download files required for the project from [here](https://drive.google.com/open?id=10yf39ifEwVihw4xeJJR60oeFBY30Y5J8).

## Step1: Load the given dataset  

1. Mount the google drive

2. Import Glove embeddings

3. Import the test and train datasets

### Mount the google drive to access required project files

Run the below commands

In [0]:
from google.colab import drive

In [2]:
drive.mount('/content/drive/')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive/


#### Path for Project files on google drive

**Note:** You need to change this path according where you have kept the files in google drive. 

In [0]:
project_path = "/content/drive/My Drive/Colab Notebooks/Sequential NLP/Fake News Challenge/"

### Loading the Glove Embeddings

In [19]:
from zipfile import ZipFile
with ZipFile(project_path+'glove.6B.zip', 'r') as z:
  z.extractall()

OSError: ignored

# Load the dataset [5 Marks]

1. Using [read_csv()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) in pandas load the given train datasets files **`train_bodies.csv`** and **`train_stances.csv`**

2. Using [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) command in pandas merge the two datasets based on the Body ID. 

Note: Save the final merged dataset in a dataframe with name **`dataset`**.

In [0]:
import pandas as pd

In [0]:
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/Sequential NLP/Fake News Challenge/')

In [0]:
df1=pd.read_csv('train_bodies.csv')

In [9]:
df1.head()

Unnamed: 0,Body ID,articleBody
0,0,A small meteorite crashed into a wooded area i...
1,4,Last week we hinted at what was to come as Ebo...
2,5,(NEWSER) – Wonder how long a Quarter Pounder w...
3,6,"Posting photos of a gun-toting child online, I..."
4,7,At least 25 suspected Boko Haram insurgents we...


In [10]:
df1.articleBody[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

In [11]:
df1.count

<bound method DataFrame.count of       Body ID                                        articleBody
0           0  A small meteorite crashed into a wooded area i...
1           4  Last week we hinted at what was to come as Ebo...
2           5  (NEWSER) – Wonder how long a Quarter Pounder w...
3           6  Posting photos of a gun-toting child online, I...
4           7  At least 25 suspected Boko Haram insurgents we...
...       ...                                                ...
1678     2528  Intelligence agencies hunting for identity of ...
1679     2529  While Daleks "know no fear" and "must not fear...
1680     2530  More than 200 schoolgirls were kidnapped in Ap...
1681     2531  A Guantanamo Bay prisoner released last year a...
1682     2532  ANN ARBOR, Mich. – A pizza delivery man in Mic...

[1683 rows x 2 columns]>

In [0]:
df2=pd.read_csv('train_stances.csv')

In [0]:
df=df1.merge(df2, how='outer',on='Body ID')


<h2> Check1:</h2>
  
<h3> You should see the below output if you run `dataset.head()` command as given below </h3>

In [14]:
df.head()

Unnamed: 0,Body ID,articleBody,Headline,Stance
0,0,A small meteorite crashed into a wooded area i...,"Soldier shot, Parliament locked down after gun...",unrelated
1,0,A small meteorite crashed into a wooded area i...,Tourist dubbed ‘Spider Man’ after spider burro...,unrelated
2,0,A small meteorite crashed into a wooded area i...,Luke Somers 'killed in failed rescue attempt i...,unrelated
3,0,A small meteorite crashed into a wooded area i...,BREAKING: Soldier shot at War Memorial in Ottawa,unrelated
4,0,A small meteorite crashed into a wooded area i...,Giant 8ft 9in catfish weighing 19 stone caught...,unrelated


In [15]:
df['articleBody']

0        A small meteorite crashed into a wooded area i...
1        A small meteorite crashed into a wooded area i...
2        A small meteorite crashed into a wooded area i...
3        A small meteorite crashed into a wooded area i...
4        A small meteorite crashed into a wooded area i...
                               ...                        
49967    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49968    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49969    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49970    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49971    ANN ARBOR, Mich. – A pizza delivery man in Mic...
Name: articleBody, Length: 49972, dtype: object

## Step2: Data Pre-processing and setting some hyper parameters needed for model


#### Run the code given below to set the required parameters.

1. `MAX_SENTS` = Maximum no.of sentences to consider in an article.

2. `MAX_SENT_LENGTH` = Maximum no.of words to consider in a sentence.

3. `MAX_NB_WORDS` = Maximum no.of words in the total vocabualry.

4. `MAX_SENTS_HEADING` = Maximum no.of sentences to consider in a heading of an article.

In [0]:
MAX_NB_WORDS = 20000
MAX_SENTS = 20
MAX_SENTS_HEADING = 1
MAX_SENT_LENGTH = 20
VALIDATION_SPLIT = 0.2

### Download the `Punkt` from nltk using the commands given below. This is for sentence tokenization.

For more info on how to use it, read [this](https://stackoverflow.com/questions/35275001/use-of-punktsentencetokenizer-in-nltk).



In [17]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Tokenizing the text and loading the pre-trained Glove word embeddings for each token  [5 marks] 

Keras provides [Tokenizer API](https://keras.io/preprocessing/text/) for preparing text. Read it before going any further.

#### Import the Tokenizer from keras preprocessing text

In [18]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [0]:
import tensorflow as tf
from tensorflow import keras

#### Initialize the Tokenizer class with maximum vocabulary count as `MAX_NB_WORDS` initialized at the start of step2. 

In [0]:
tokenizer=keras.preprocessing.text.Tokenizer(num_words=MAX_NB_WORDS)

#### Now, using fit_on_texts() from Tokenizer class, lets encode the data 

Note: We need to fit articleBody and Headline also to cover all the words.

In [0]:
articles=tokenizer.fit_on_texts(df['articleBody'])

In [23]:
tokenizer.word_index

{'the': 1,
 'to': 2,
 'a': 3,
 'of': 4,
 'in': 5,
 'and': 6,
 'that': 7,
 'is': 8,
 'was': 9,
 'on': 10,
 'for': 11,
 'said': 12,
 'he': 13,
 'with': 14,
 'it': 15,
 'his': 16,
 'have': 17,
 'as': 18,
 'by': 19,
 'has': 20,
 'from': 21,
 'at': 22,
 'be': 23,
 'an': 24,
 'not': 25,
 'are': 26,
 'been': 27,
 '”': 28,
 'but': 29,
 'this': 30,
 'had': 31,
 'who': 32,
 'they': 33,
 'after': 34,
 'i': 35,
 'were': 36,
 'we': 37,
 'will': 38,
 'about': 39,
 'one': 40,
 'or': 41,
 'which': 42,
 'she': 43,
 'video': 44,
 'apple': 45,
 'up': 46,
 'would': 47,
 'her': 48,
 'state': 49,
 'their': 50,
 'also': 51,
 'more': 52,
 'when': 53,
 'told': 54,
 'out': 55,
 'isis': 56,
 'all': 57,
 'no': 58,
 'new': 59,
 'people': 60,
 'there': 61,
 'you': 62,
 'its': 63,
 'if': 64,
 'him': 65,
 'news': 66,
 'what': 67,
 'could': 68,
 'man': 69,
 'year': 70,
 'islamic': 71,
 'time': 72,
 'some': 73,
 'al': 74,
 'according': 75,
 'watch': 76,
 'over': 77,
 'group': 78,
 'into': 79,
 'so': 80,
 'first': 81,
 

In [0]:
headlines=tokenizer.fit_on_texts(df['Headline'])

In [0]:
wordindex=tokenizer.word_index

In [26]:
len(wordindex)

27873

#### fit_on_texts() gives the following attributes in the output as given [here](https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/).

* **word_counts:** dictionary mapping words (str) to the number of times they appeared on during fit. Only set after fit_on_texts was called.

* **word_docs:** dictionary mapping words (str) to the number of documents/texts they appeared on during fit. Only set after fit_on_texts was called.

* **word_index:** dictionary mapping words (str) to their rank/index (int). Only set after fit_on_texts was called.

* **document_count:** int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_on_texts or fit_on_sequences was called.



### Now, tokenize the sentences using nltk sent_tokenize() and encode the senteces with the ids we got form the above `t.word_index`

Initialise 2 lists with names `texts` and `articles`.

```
texts = [] to store text of article as it is.

articles = [] split the above text into a list of sentences.
```

In [0]:
from nltk.tokenize import sent_tokenize

In [0]:
texts=[]

In [0]:
texts = df['articleBody']


In [30]:
texts.count

<bound method Series.count of 0        A small meteorite crashed into a wooded area i...
1        A small meteorite crashed into a wooded area i...
2        A small meteorite crashed into a wooded area i...
3        A small meteorite crashed into a wooded area i...
4        A small meteorite crashed into a wooded area i...
                               ...                        
49967    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49968    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49969    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49970    ANN ARBOR, Mich. – A pizza delivery man in Mic...
49971    ANN ARBOR, Mich. – A pizza delivery man in Mic...
Name: articleBody, Length: 49972, dtype: object>

In [31]:
texts[100]

'At least 25 suspected Boko Haram insurgents were killed in clashes between soldiers and the Islamist militants in northeast Nigeria and five civilians were killed in fighting elsewhere in the region, a military source and residents said on Monday.\n\nA ceasefire agreement between Boko Haram and the Nigerian government was expected to lead to the liberation of more than 200 schoolgirls kidnapped by the militants six months ago, and talks were due to continue in neighbouring Chad on Monday.\n\nBoko Haram has not confirmed the truce and there have been at least six attacks over the weekend – blamed by security sources on the insurgents – that have killed several dozen people since the announcement of the ceasefire.\n\nA government spokesman has said that the fighting on Sunday may be the work of criminal gangs in the lawless region.\n\nAn army officer, who requested anonymity, said the militants tried to enter the town of Damboa late on Sunday through Alagarno, a Boko Haram hideout, but 

In [0]:
sents=[]

In [0]:
sents = texts.apply(sent_tokenize)

In [34]:
sents[100]

['At least 25 suspected Boko Haram insurgents were killed in clashes between soldiers and the Islamist militants in northeast Nigeria and five civilians were killed in fighting elsewhere in the region, a military source and residents said on Monday.',
 'A ceasefire agreement between Boko Haram and the Nigerian government was expected to lead to the liberation of more than 200 schoolgirls kidnapped by the militants six months ago, and talks were due to continue in neighbouring Chad on Monday.',
 'Boko Haram has not confirmed the truce and there have been at least six attacks over the weekend – blamed by security sources on the insurgents – that have killed several dozen people since the announcement of the ceasefire.',
 'A government spokesman has said that the fighting on Sunday may be the work of criminal gangs in the lawless region.',
 'An army officer, who requested anonymity, said the militants tried to enter the town of Damboa late on Sunday through Alagarno, a Boko Haram hideout,

## Check 2:

first element of texts and articles should be as given below. 

In [35]:
texts[0]

'A small meteorite crashed into a wooded area in Nicaragua\'s capital of Managua overnight, the government said Sunday. Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city\'s airport, the Associated Press reports. \n\nGovernment spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth." House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports. \nMurillo said Nicaragua will ask international experts to help local scientists in understanding what happened.\n\nThe crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee. He said it is still not clear if the meteorite disintegrated or was buried.\n\nHumbe

In [36]:
sents[1]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

In [37]:
sents.count

<bound method Series.count of 0        [A small meteorite crashed into a wooded area ...
1        [A small meteorite crashed into a wooded area ...
2        [A small meteorite crashed into a wooded area ...
3        [A small meteorite crashed into a wooded area ...
4        [A small meteorite crashed into a wooded area ...
                               ...                        
49967    [ANN ARBOR, Mich. – A pizza delivery man in Mi...
49968    [ANN ARBOR, Mich. – A pizza delivery man in Mi...
49969    [ANN ARBOR, Mich. – A pizza delivery man in Mi...
49970    [ANN ARBOR, Mich. – A pizza delivery man in Mi...
49971    [ANN ARBOR, Mich. – A pizza delivery man in Mi...
Name: articleBody, Length: 49972, dtype: object>

# Now iterate through each article and each sentence to encode the words into ids using t.word_index  [5 marks] 

Here, to get words from sentence you can use `text_to_word_sequence` from keras preprocessing text.

1. Import text_to_word_sequence

2. Initialize a variable of shape (no.of articles, MAX_SENTS, MAX_SENT_LENGTH) with name `data` with zeros first (you can use numpy [np.zeros](https://docs.scipy.org/doc/numpy/reference/generated/numpy.zeros.html) to initialize with all zeros)and then update it while iterating through the words and sentences in each article.

In [0]:
from keras.preprocessing.text import text_to_word_sequence

In [0]:
import array 
import re

In [0]:
import numpy as np
data=np.zeros((49972,1000,1000))

In [0]:
keras.preprocessing.text.text_to_word_sequence(sents)


In [55]:
len(sents)

49972

In [59]:
sents[1]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.",
 "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.",
 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."',
 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.',
 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.',
 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.',
 'He said it is still not clear if the meteorite disintegrated or was bu

In [1]:
for i,sublist_sentence in enumerate(sents[0:2]):
  #print(sublist_sentence)
  for j,sentence in enumerate(sublist_sentence):
    #print(sentence.strip())
    #if len(sentence.strip()) < 20:
    words = text_to_word_sequence(str(sentence))
    print(words)
    for k,w in enumerate(words):
      #if len(w) < 20:
      data[i,j,k] = wordindex[w]

NameError: ignored

In [77]:
for i in sents[0:2]:
  #print(i)
  for j in i:
    #print('This is Sentence:',j)
    print('Length :',len(j.strip()))

Length : 117
Length : 131
Length : 225
Length : 115
Length : 110
Length : 200
Length : 75
Length : 199
Length : 68
Length : 135
Length : 51
Length : 114
Length : 89
Length : 104
Length : 85
Length : 59
Length : 117
Length : 131
Length : 225
Length : 115
Length : 110
Length : 200
Length : 75
Length : 199
Length : 68
Length : 135
Length : 51
Length : 114
Length : 89
Length : 104
Length : 85
Length : 59


In [64]:
for sentence in sents[:2]:
  print(sentence)
  if len(sentence.strip()) < 20:
    # sent = sentence.strip()
    words = text_to_word_sequence(str(sentence))
    for w in words:
      if len(w) < 20:
        if w in wordindex:#tokenizer.word_index:
          data[i, j, k] = wordindex[w] #tokenizer.word_index[w]
  #   for j,word_ in enumerate(words):
  #     if len(j) < 20:

  # #short_sents = re.split(sentence)
  # for j, sent in enumerate(str(sentence)):
  #   print(j,sent)
  #   #if len(sent) < 20:
  #   if j < 20 and sent.strip():
  #     words = text_to_word_sequence(str(sentence))
  #     k = 0
  #     for w in words:
  #       if k < 20:
  #         if w in wordindex:#tokenizer.word_index:
  #           data[i, j, k] = wordindex[w] #tokenizer.word_index[w]

["A small meteorite crashed into a wooded area in Nicaragua's capital of Managua overnight, the government said Sunday.", "Residents reported hearing a mysterious boom that left a 16-foot deep crater near the city's airport, the Associated Press reports.", 'Government spokeswoman Rosario Murillo said a committee formed by the government to study the event determined it was a "relatively small" meteorite that "appears to have come off an asteroid that was passing close to Earth."', 'House-sized asteroid 2014 RC, which measured 60 feet in diameter, skimmed the Earth this weekend, ABC News reports.', 'Murillo said Nicaragua will ask international experts to help local scientists in understanding what happened.', 'The crater left by the meteorite had a radius of 39 feet and a depth of 16 feet,  said Humberto Saballos, a volcanologist with the Nicaraguan Institute of Territorial Studies who was on the committee.', 'He said it is still not clear if the meteorite disintegrated or was buried.'

AttributeError: ignored

In [66]:
data[0, :, :]

array([[178.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [178.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [178.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [178.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [178.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [178.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [178.,   0.,   0.,   0.,   0.,   0

In [0]:
data.shape

(49972, 20, 20)

### Check 3:

Accessing first element in data should give something like given below.

In [0]:
data[0, :, :]

array([[    3,   487,   474,  7113,    79,     3,  3687,   325,     5,
         4200,   361,     4,  1525,  2913,     1,    89,    12,   451,
            0,     0],
       [  743,    96,  1044,     3,  2814,  1759,     7,   186,     3,
         1219,  1070,  1987,   736,   154,     1,  2990,   458,     1,
          543,   232],
       [   89,  1052,  4057,  2314,    12,     3,  1073,  3248,    19,
            1,    89,     2,  1751,     1,   518,  1980,    15,     9,
            3,  2879],
       [  182,  3691,   976,   196,  2515,    42,  6688,  1691,  1227,
            5, 13011, 17379,     1,   762,    30,   722,  3931,    66,
           87,     0],
       [ 2314,    12,  1882,    38,  1076,   346,   793,     2,   356,
          261,  1782,     5,  4396,    67,   486,     0,     0,     0,
            0,     0],
       [    1,   736,   186,    19,     1,   474,    32,     3,  7307,
            4,  2122,  1227,     6,     3,  5195,     4,  1219,  1227,
           12,  3308],
       [  

# Repeat the same process for the `Headings` as well. Use variables with names `texts_heading` and `articles_heading` accordingly. [5 marks] 

texts = [] to store text of article as it is.
 
articles = [] split the above text into a list of sentences.

In [0]:
texts_heading=[]
articles_heading=[]

In [0]:
tokenizer2=keras.preprocessing.text.Tokenizer(num_words=MAX_NB_WORDS)

In [0]:
headlines=tokenizer2.fit_on_texts(df['Headline'])

In [0]:
wordindex2=tokenizer2.word_index

In [0]:
len(wordindex2)

3879

In [0]:
texts_heading = df['Headline']


In [0]:
texts_heading.count

<bound method Series.count of 0        Soldier shot, Parliament locked down after gun...
1        Tourist dubbed ‘Spider Man’ after spider burro...
2        Luke Somers 'killed in failed rescue attempt i...
3         BREAKING: Soldier shot at War Memorial in Ottawa
4        Giant 8ft 9in catfish weighing 19 stone caught...
                               ...                        
49967    Pizza delivery man gets tipped more than $2,00...
49968                   Pizza delivery man gets $2,000 tip
49969     Luckiest Pizza Delivery Guy Ever Gets $2,000 Tip
49970    Ann Arbor pizza delivery driver surprised with...
49971    Ann Arbor pizza delivery driver surprised with...
Name: Headline, Length: 49972, dtype: object>

In [0]:
heading_sents = texts_heading.apply(sent_tokenize)

In [0]:
heading_data=np.zeros((49972,MAX_SENTS,MAX_SENT_LENGTH))

In [0]:
for i, sentence in enumerate(heading_sents):
  #short_sents = re.split(sentence)
  for j, sent in enumerate(str(sentence)):
    if j < 20 and sent.strip():
      words = text_to_word_sequence(str(sentence))
      k = 0
      for w in words:
        if k < 20:
          if w in wordindex2:#tokenizer.word_index:
            heading_data[i, j, k] = wordindex2[w] #tokenizer.word_index[w]

In [0]:
heading_data[0,:,:]

array([[176.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [176.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [176.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [176.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [176.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [176.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [176.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
          0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.],
       [176.,   0.,   0.,   0.,   0.,   0

### Now the features are ready, lets make the labels ready for the model to process.

### Convert labels into one-hot vectors

You can use [get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) in pandas to create one-hot vectors.

In [0]:
labels=df['Stance']

In [0]:
labels=pd.get_dummies(labels)

In [0]:
labels.shape

(49972, 4)

In [0]:
labels.head()

Unnamed: 0,agree,disagree,discuss,unrelated
0,0,0,0,1
1,0,0,0,1
2,0,0,0,1
3,0,0,0,1
4,0,0,0,1


### Check 4:

The shape of data and labels shoould match the given below numbers.

In [0]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (49972, 20, 20)
Shape of label tensor: (49972, 4)


### Shuffle the data

In [0]:
## get numbers upto no.of articles
indices = np.arange(data.shape[0])
## shuffle the numbers
np.random.shuffle(indices)

In [0]:
## shuffle the data
data = data[indices]
heading_data = heading_data[indices]
## shuffle the labels according to data
labels = labels[indices]

KeyError: ignored

### Split into train and validation sets. Split the train set 80:20 ratio to get the train and validation sets.


Use the variable names as given below:

x_train, x_val - for body of articles.

x-heading_train, x_heading_val - for heading of articles.

y_train - for training labels.

y_val - for validation labels.



In [0]:
from sklearn.model_selection import train_test_split

In [0]:
x_train,x_val,y_train,y_val = train_test_split(data,labels,test_size = 0.2)

In [0]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

(39977, 20, 20)
(39977, 4)
(9995, 20, 20)
(9995, 4)


### Check 5:

The shape of x_train, x_val, y_train and y_val should match the below numbers.

In [0]:
print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

(39978, 20, 20)
(39978, 4)
(9994, 20, 20)
(9994, 4)


### Create embedding matrix with the glove embeddings


Run the below code to create embedding_matrix which has all the words and their glove embedding if present in glove word list.

In [0]:
os.chdir('/content/drive/My Drive/Colab Notebooks/Sequential NLP/')

In [0]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('glove.6B.100d.txt')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

# create a weight matrix for words in training docs
embedding_matrix = np.zeros((27427, 100))


for word, i in wordindex.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

Loaded 400000 word vectors.


# Try the sequential model approach and report the accuracy score. [10 marks]  

### Import layers from Keras to build the model

In [0]:
import tensorflow as tf

tf.keras.backend.clear_session()
model = tf.keras.Sequential()

### Model

In [0]:
model.add(tf.keras.layers.Embedding(27427 + 1, #Vocablury size
                                    50, #Embedding size
                                    input_length=20) #Number of words in each review
          )

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.


In [0]:
model.add(tf.keras.layers.LSTM(256, #RNN State - size of cell state and hidden state
                               dropout=0.2, #Dropout before feeding the data to LSTM layer
                               recurrent_dropout=0.2)) #Dropout applied to the output of LSTM layer

### Compile and fit the model

In [0]:
model.add(tf.keras.layers.Dense(1,activation='sigmoid'))

In [0]:
from keras.losses import categorical_crossentropy
from keras.optimizers import Adam
optimizer = Adam(lr=0.001)

In [0]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [0]:
model.fit(x_train,y_train,
          epochs=20,
          batch_size=32,          
          validation_data=(x_val, y_val))

## Build the same model with attention layers included for better performance (Optional)

# **Note , Due to Session of colab is getting crashed again and agian, I am uploading file as it is . Though code is correct, I couldn't show all the outputs. **

## Fit the model and report the accuracy score for the model with attention layer (Optional)