# Natural Language Processing for Policy Making

## Description of Data:
We have a text data of special mentions in Rajya Sabha from session 194 to session 243 (the most recent). The ministry to which the special mention (a kind of question/demand) is then liable to reply to it. Since some of the mentions go unreplied, we attempt to predict which of the recent session's mentions are likely to be unreplied based on the trend obtained from previous sessions.  

### We have 3 files of data:
1. 'new_pos.txt': special mentions which got replies
2. 'new_neg.txt': special mentions which didnot get replies
3. 'predict-on.txt': special mentions of the recent session

### 1. Lets import the data:

In [1]:
with open('new_pos.txt','r') as a, open ('new_neg.txt','r') as b, open ('predict-on.txt','r') as c:
    contentspos = a.read().splitlines()
    contentsneg = b.read().splitlines()
    contentspre = c.read().splitlines()
    contents = contentspos + contentsneg + contentspre
a.close()
b.close()
c.close()

### 2. Lets build a tokenizer/dictionary from the data using Keras:

In [2]:
from keras.preprocessing.text import Tokenizer
t = Tokenizer(num_words=None, lower=True, split=' ', char_level=False, oov_token=None)
t.fit_on_texts(contents)
vocab_size = len(t.word_index)

Using TensorFlow backend.


### 3. Lets convert the given text to tokens using the dictionary:

In [3]:
tokenized_docs = t.texts_to_sequences(contents)

### 4. Lets pad the sequences to the max length of the mention:

In [4]:
max_len = 0
for i in tokenized_docs:
    if len(i)>max_len:
        max_len = len(i)
from keras.preprocessing.sequence import pad_sequences
padded_docs = pad_sequences(tokenized_docs,maxlen=max_len,dtype='int32',padding='pre',truncating='pre',value=0)

### 5. Lets shuffle the dataset (not containing the texts to be predicted) and divide in test/train set:

In [6]:
import random
import numpy as np
labels = list(np.ones(len(contentspos))) + list(np.zeros(len(contentsneg)))
for_shuffling = list(zip(padded_docs[:len(contentspos)+len(contentsneg)],labels)) #gives us a list of tuples
random.shuffle(for_shuffling)
test_size = int(0.2*len(for_shuffling))
X = [e[0] for e in for_shuffling]
y = [e[1] for e in for_shuffling]
X_train = X[:-test_size]
y_train = y[:-test_size]
X_test = X[-test_size:]
y_test = y[-test_size:]
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

### 6. Lets build the model now:

In [7]:
from keras.models import Sequential
from keras.layers.embeddings import Embedding
from keras.layers import Dense, LSTM
from keras.optimizers import Adam
model = Sequential()
model.add(Embedding(vocab_size,256,input_length=max_len))
model.add(LSTM(units=32, return_sequences=True))
model.add(LSTM(units=8, return_sequences=False))
model.add(Dense(1,activation='sigmoid'))
optimizer = Adam(lr=1e-3)
model.compile(loss='binary_crossentropy',optimizer=optimizer,metrics=['accuracy'])
model.summary()
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=50,batch_size=32,verbose=2)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 36, 256)           1486848   
_________________________________________________________________
lstm_1 (LSTM)                (None, 36, 32)            36992     
_________________________________________________________________
lstm_2 (LSTM)                (None, 8)                 1312      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 9         
Total params: 1,525,161
Trainable params: 1,525,161
Non-trainable params: 0
_________________________________________________________________
Train on 2583 samples, validate on 645 samples
Epoch 1/50
 - 8s - loss: 0.6556 - acc: 0.6009 - val_loss: 0.6450 - val_acc: 0.6124
Epoch 2/50
 - 7s - loss: 0.5283 - acc: 0.7522 - val_loss: 0.6521 - val_acc: 0.6589
Epoch 3/50
 - 7s - loss: 0.3318 - acc: 0.8815 - val

<keras.callbacks.History at 0x7fd3f9b091d0>