## Deep Learning Course (980)
## Assignment Three 

__Assignment Goals:__

- Implementing RNN based language models.
- Implementing and applying a Recurrent Neural Network on text classification problem using TensorFlow.
- Implementing __many to one__ and __many to many__ RNN sequence processing.

In this assignment, you will implement RNN-based language models and compare extracted word representation from different models. You will also compare two different training methods for sequential data: Truncated Backpropagation Through Time __(TBTT)__ and Backpropagation Through Time __(BTT)__. 
Also, you will be asked to apply Vanilla RNN to capture word representations and solve a text classification problem. 


__DataSets__: You will use two datasets, an English Literature for language model task (part 1 to 4) and 20Newsgroups for text classification (part 5). 


1. (30 points) Implement the RNN based language model described by Mikolov et al.[1], also called __Elman network__ and train a language model on the English Literature dataset. This network contains input, hidden and output layer and is trained by standard backpropagation (TBTT with τ = 1) using the cross-entropy loss. 
   - The input represents the current word while using 1-of-N coding (thus its size is equal to the size of the vocabulary) and vector s(t − 1) that represents output values in the hidden layer from the previous time step. 
   - The hidden layer is a fully connected sigmoid layer with size 500. 
   - Softmax Output Layer to capture a valid probability distribution.
   - The model is trained with truncated backpropagation through time (TBTT) with τ = 1: the weights of the network are updated based on the error vector computed only for the current time step.
   
   Download the English Literature dataset and train the language model as described, report the model cross-entropy loss on the train set. Use nltk.word_tokenize to tokenize the documents. 
For initialization, s(0) can be set to a vector of small values. Note that we are not interested in the *dynamic model* mentioned in the original paper. 
To make the implementation simpler you can use Keras to define neural net layers, including Keras.Embedding. (Keras.Embedding will create an additional mapping layer compared to the Elman architecture.) 

2. (20 points) TBTT has less computational cost and memory needs in comparison with *backpropagation through time algorithm (BTT)*. These benefits come at the cost of losing long term dependencies [2]. Now let's try to investigate computational costs and performance of learning our language model with BTT. For training the Elman-type RNN with BTT, one option is to perform mini-batch gradient descent with exactly one sentence per mini-batch. (The input  size will be [1, Sentence Length]). 

    1. Split the document into sentences (you can use nltk.tokenize.sent_tokenize).
    2. For each sentence, perform one pass that computes the mean/sum loss for this sentence; then perform a gradient update for the whole sentence. (So the mini-batch size varies for the sentences with different lengths). You can truncate long sentences to fit the data in memory. 
    3. Report the model cross-entropy loss.

3. (15 points) It does not seem that simple recurrent neural networks can capture truly exploit context information with long dependencies, because of the problem that gradients vanish and exploding. To solve this problem, gating mechanisms for recurrent neural networks were introduced. Try to learn your last model (Elman + BTT) with the SimpleRnn unit replaced with a Gated Recurrent Unit (GRU). Report the model cross-entropy loss. Compare your results in terms of cross-entropy loss with two other approach(part 1 and 2). Use each model to generate 10 synthetic sentences of 15 words each. Discuss the quality of the sentences generated - do they look like proper English? Do they match the training set?
    Text generation from a given language model can be done using the following iterative process:
   1. Set sequence = \[first_word\], chosen randomly.
   2. Select a new word based on the sequence so far, add this word to the sequence, and repeat. At each iteration, select the word with maximum probability given the sequence so far. The trained language model outputs this probability. 

4. (15 points) The text describes how to extract a word representation from a trained RNN (Chapter 4). How we can evaluate the extracted word representation for your trained RNN? Compare the words representation extracted from each of the approaches using one of the existing methods.

5. (20 points) We are aiming to learn an RNN model that predicts document categories given its content (text classification). For this task, we will use the 20Newsgroupst dataset. The 20Newsgroupst contains messages from twenty newsgroups.  We selected four major categories (comp, politics, rec, and religion) comprising around 13k documents altogether. Your model should learn word representations to support the classification task. For solving this problem modify the __Elman network__ architecture such that the last layer is a softmax layer with just 4 output neurons (one for each category). 

    1. Download the 20Newsgroups dataset, and use the implemented code from the notebook to read in the dataset.
    2. Split the data into a training set (90 percent) and validation set (10 percent). Train the model on  20Newsgroups.
    3. Report your accuracy results on the validation set.

__NOTE__: Please use Jupyter Notebook. The notebook should include the final code, results and your answers. You should submit your Notebook in (.pdf or .html) and .ipynb format. (penalty 10 points) 

To reduce the parameters, you can merge all words that occur less often than a threshold into a special rare token (\__unk__).

__Instructions__:

The university policy on academic dishonesty and plagiarism (cheating) will be taken very seriously in this course. Everything submitted should be your own writing or coding. You must not let other students copy your work. Spelling and grammar count.

Your assignments will be marked based on correctness, originality (the implementations and ideas are from yourself), clarification and test performance.


[1] Tom´ as Mikolov, Martin Kara ˇ fiat, Luk´ ´ as Burget, Jan ˇ Cernock´ ˇ y,Sanjeev Khudanpur: Recurrent neural network based language model, In: Proc. INTERSPEECH 2010

[2] Tallec, Corentin, and Yann Ollivier. "Unbiasing truncated backpropagation through time." arXiv preprint arXiv:1705.08209 (2017).


In [64]:
from numpy import array
import numpy as np

from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import SimpleRNN, TimeDistributed, Flatten
import keras.backend as K


# source text
data_file = "./datasets/English Literature.txt"

data = open(data_file).read().strip()

tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(data)

tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]


vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 12633


In [43]:
! pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
def generate_seq(model, tokenizer, seed_text, n_words):
    result = []
    in_text = seed_text
    result.append(seed_text)
    resultS = tokenizer.index_word.get(result[len(result) - 1])
    for _ in range(n_words):
        encoded = model.predict(np.array(result).reshape(1, -1))[0]
        encoded = np.log(encoded) 
        encoded = np.exp(encoded) / np.exp(encoded).sum()
        max = np.random.multinomial(1, encoded/(sum(encoded)+0.000001), 1)[0]
        encoded = np.argmax(max)
        result.append(encoded+1)
        # print (tokenizer.index_word.get(result[len(result) - 1]))
        resultS = resultS + " " + tokenizer.index_word.get(result[len(result) - 1])
    return resultS
 

## Question 1:

As you can see the acc is about 13% and the perplexity is 209558.2136

There is different perplexity defeinition, which you can see my perplexity function below

Also, I generated the 10 sentence with size of 15 at the end of this part.


In [0]:
def perplexity(y_true, y_pred):
    cross_entropy = K.categorical_crossentropy(y_true, y_pred)
    perplexity = K.pow(np.e, cross_entropy)
    return perplexity

In [119]:
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

sequences = array(sequences)
X, y = sequences[:,0],sequences[:,1]


y = to_categorical(y, num_classes=vocab_size)
print (X.shape)

Total Sequences: 204088
(204088,)


In [120]:


# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=None))
model.add(SimpleRNN(500, return_sequences=False, activation='sigmoid'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())


model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', perplexity])

model.fit(X, y, epochs=20, verbose=1)


Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_12 (Embedding)     (None, None, 50)          631650    
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (None, 500)               275500    
_________________________________________________________________
dense_12 (Dense)             (None, 12633)             6329133   
Total params: 7,236,283
Trainable params: 7,236,283
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f26389b5588>

In [0]:
model.save("model.h5")

from google.colab import files
files.download('model.h5')

In [75]:
print ("First model:")
for i in range(0, 10):  
  print (i + 1)
  print(generate_seq(model, tokenizer, random.randrange(vocab_size), 15))

First model:
1


  del sys.path[0]


forbidden surgeon can like so thing us i thee i gloucester thou which him rash awaking
2
time's lucio are i fair what baptista of are shall draw and what thee not given
3
invocate will thy art him of and art captain to clear wearers would and unless on
4
bestride and gloucester where do my catesby see our could like security my by so to
5
reproof to them an awhile go then near thou scold to love thou drops in he
6
request mercy and both jest please face kate you do me richard whose legs that going
7
cowardly is actions was first now queen i as have behoveful with it may for him
8
although of and is all no shepherd shall claudio artificial dost go cannot hands of were
9
braves you work my now of be he'll stay'd this souls is sets be gold have
10
employment against and lived a side of burgundy to and go spend with clifford to slaves


## Question 2:

As you can see these are the results:

Loss: {4.692541230735192}

ACC: {0.15408897705967708}

perplexity: {256945.7147414976}

Also, I generated the 10 sentence with size of 15 at the end of this part.


In [0]:
from keras.callbacks import ModelCheckpoint
import tensorflow as tf
from nltk.tokenize import sent_tokenize
from itertools import chain 


data_file = "./datasets/English Literature.txt"

data = open(data_file).read().strip()


encoded = sent_tokenize(data)

tokenizer = Tokenizer()

X = []
y = []
sequencesX = []
sequencesY = []

for i in range (0, len(encoded)):
    tokenizer.fit_on_texts([encoded[i]])
    words = tokenizer.texts_to_sequences([encoded[i]])[0]
    if (len(words) <= 1):
        continue
    sequence = words[0:len(words) - 1]
    sequencesX.append(sequence)
    sequence = words[1:len(words)]
    sequencesY.append(sequence)

sequencesX = array(sequencesX)
sequencesY = array(sequencesY)

X = sequencesX
y = sequencesY


In [0]:
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')

In [46]:
model2 = Sequential()
model2.add(Embedding(vocab_size, 50, input_length=None))
model2.add(SimpleRNN(500, return_sequences=False, activation='sigmoid'))
model2.add(Dense(vocab_size, activation='softmax'))
print(model2.summary())
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])

for e in range(20):
    print("Epoch:" , (e + 1))
    loss = 0
    acc = 0
    per = 0
    c = 0
    for i in range(len(X)):
        tmp = np.array(X[i])
        sequenceY=[]
        sequenceX=[]
        if(len(y[i]) < 1):
            continue
        else:
            for j in range(len(y[i])):
                sequenceY.append(y[i][j])
            for j in range(len(np.array(X[i]))):
                sequenceX.append(tmp[j])
            sequenceX = np.array(sequenceX)
            y = to_categorical(sequenceY, num_classes=vocab_size)
            history = model2.fit(sequenceX, y,verbose=0, batch_size=len(tmp),  epochs=1)
            loss += history.history['loss'][0]
            acc += history.history['acc'][0]
            per += history.history['perplexity'][0]
            c += 1
    print("Loss:", {loss / c})
    print("ACC:", {acc / c})
    print("perplexity:", {per / c})

Model: "sequential_22"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_21 (Embedding)     (None, None, 50)          631650    
_________________________________________________________________
simple_rnn_20 (SimpleRNN)    (None, 500)               275500    
_________________________________________________________________
dense_21 (Dense)             (None, 12633)             6329133   
Total params: 7,236,283
Trainable params: 7,236,283
Non-trainable params: 0
_________________________________________________________________
None
Epoch: 1
Loss: {7.221953507156118}
ACC: {0.03287501753406422}
perplexity: {815854.147357508}
Epoch: 2
Loss: {6.604992292591675}
ACC: {0.06046957436715353}
perplexity: {436012.5156722426}
Epoch: 3
Loss: {6.2889178734597975}
ACC: {0.08113147898301172}
perplexity: {361909.70728899864}
Epoch: 4
Loss: {6.0862827845805265}
ACC: {0.09303785805825926}
perplexity: {339181.39

In [0]:
model2.save("model2.h5")

from google.colab import files
files.download('model2.h5')

In [79]:
model2 = Sequential()
model2.add(Embedding(vocab_size, 50, input_length=None))
model2.add(SimpleRNN(500, return_sequences=False, activation='sigmoid'))
model2.add(Dense(vocab_size, activation='softmax'))
def loadModel():
  model2.load_weights("./model2 (1).h5")

loadModel()
print ("Second model:")
for i in range(0, 10): 
  print(generate_seq(model2, tokenizer, random.randrange(vocab_size), 15))

Second model:
whores to were and of and therefore play king thou thanks florizel sit to noble what's
stomachers lodowick to strike york and against my affliction not pompey allowed is that provost an
trebles in belly not well though so or all in he both from he not wilt


  del sys.path[0]


paris' were thou sign it on ere be scandal you take a lady they ear me
excuses to and these elizabeth my richmond to it on awhile to apparel to if be
temporizer in faced to that stumble to your faith to your from henry farewell from menenius
shoe changed read he what ha gave it horse i late of huntsman be fortunes hit
nicety to hither did forward brutus go i keep if of his hence to sir cannot
disorder seat my done this empty elizabeth were to love my state they us king you
kernel he in here came duke made to if him love thou dark nursed reprieves with


## Question 3:

As you can see we got this results:

Loss: {4.34799251055036}

ACC: {0.16645361445349946}

perplexity: {121986.61053915822}

As you can see the loss of this model with GRU is less than the models we used in Q1 and Q2.

You can see the generated sentences for all three models at the end of this part.

the quality of sentences from the GRU model is better as we expect.

The sentences in general is a little similar to English, They make sense besides each other but we can't count them as a meaningful sentence for sure. for having better generated sentences we need to increase the numner of epochs which I was unable due to limitations in resources.
But they are kind of similar to the training set.

In [67]:
from keras.callbacks import ModelCheckpoint
import tensorflow as tf
from nltk.tokenize import sent_tokenize
from itertools import chain 
from keras.layers import GRU


model3 = Sequential()
model3.add(Embedding(vocab_size, 50, input_length=None))
model3.add(GRU(500, return_sequences=False, activation='sigmoid'))
model3.add(Dense(vocab_size, activation='softmax'))
print(model3.summary())
model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy', perplexity])
for e in range(20):
    print("Epoch:" , (e + 1))
    loss = 0
    acc = 0
    per = 0
    c = 0
    for i in range(len(X)):
        tmp = np.array(X[i])
        sequenceY=[]
        sequenceX=[]
        if(len(y[i]) < 1):
            continue
        else:
            for j in range(len(y[i])):
                sequenceY.append(y[i][j])
            for j in range(len(np.array(X[i]))):
                sequenceX.append(tmp[j])
            sequenceX = np.array(sequenceX)
            y = to_categorical(sequenceY, num_classes=vocab_size)
            history = model3.fit(sequenceX, y,verbose=0, batch_size=len(tmp),  epochs=1)
            loss += history.history['loss'][0]
            acc += history.history['acc'][0]
            per += history.history['perplexity'][0]
            c += 1
    print("Loss:", {loss / c})
    print("ACC:", {acc / c})
    print("perplexity:", {per / c})

Model: "sequential_20"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_20 (Embedding)     (None, None, 50)          631650    
_________________________________________________________________
gru_2 (GRU)                  (None, 500)               826500    
_________________________________________________________________
dense_19 (Dense)             (None, 12633)             6329133   
Total params: 7,787,283
Trainable params: 7,787,283
Non-trainable params: 0
_________________________________________________________________
None
Epoch: 1
Loss: {7.015517119945779}
ACC: {0.034783006615057725}
perplexity: {447143.91879701643}
Epoch: 2
Loss: {6.416574023560013}
ACC: {0.06383896742052501}
perplexity: {196616.564725189}
Epoch: 3
Loss: {6.0673916762581666}
ACC: {0.08300309828349565}
perplexity: {167324.09997899493}
Epoch: 4
Loss: {5.8144855614327176}
ACC: {0.09624678261184515}
perplexity: {157557.

In [0]:
model3.save("model33.h5")

from google.colab import files
files.download('model33.h5')

In [72]:

print ("Third model:")
for i in range(0, 10):  
  print (i + 1)
  print(generate_seq(model3, tokenizer, random.randrange(vocab_size), 15))


Third model:
1
market knave i preys lasted callest standard ethiopian's grissel unsway'd altitude pinch captum opportunity spicery minimo
2


  del sys.path[0]


sky and reprobate screen happened knowist baptized blades adopts waned admit gardener stabbing unbraided feathers crook'd
3
innocent power of distinctly mustering slop stew cement effects hoarded pother pinion'd vulgarly suffolk walled shaken
4
royalties have unvalued forgets feather'd unpossible covenants crassus dishonourable sayings hawthorn i'm est creating disobedient tumble
5
nobility to i as lucky woodstock's connive disclaiming misleading make's predominant predominant subjected emmew 'point hoarded
6
dulcet to to tuns burthenous businesses nursing separation wing'd burthenous poland believing nineteen banners courtiers' chapless
7
carpets worship distant fair'st crickets uplifted nursing chivalrous inclinest machiavel court'sy pinion'd unmusical confess'd callest grandchild
8
glut so fustian filth 'priami separation cropp'd dresser wards confusions consolation quoifs properties reform'd ebb'd laudis
9
horseman which which prizes connive burthenous poesy emboss'd tenement allow

In [82]:
print ("First model:")
for i in range(0, 10):  
  print (i + 1)
  print(generate_seq(model, tokenizer, random.randrange(vocab_size) , 15))

print ("Second model:")
for i in range(0, 10): 
  print (i + 1)
  print(generate_seq(model2, tokenizer, random.randrange(vocab_size), 15))

print ("Third model:")
for i in range(0, 10):  
  print (i + 1)
  print(generate_seq(model3, tokenizer, random.randrange(vocab_size), 15))

First model:
1
hoop i her again brother should courage my red worth to his tear to he sire
2
rod i at his we grow that nay no devil on he her which us me
3


  del sys.path[0]


tractable i warwick take it not dove thou enough to clap brutus shall than him o
4
vilely pawn is may thee jewel you framed o beat my nor question to if my
5
traditional limbs strange friends and sea excels king o or and let make better we both
6
unwittingly which richard whose doxy warrant time you body i they time i by thinks in
7
abominable him draw are stands with so thee provost where my norfolk queen camillo gone of
8
snakes is henry farewell live should term duke is witch tunis o all here this is
9
fume you me to whom him i loving now whom him clarence all a mind thy
10
bail york know it at that than bear myself brutus eye ill what me of report
Second model:
1
herein wipe it they was this go name not edward that true to holy strike place
2
volscians men king what present arms tent on if have own wife them blood be night
3
pitied'st bitter were katharina all heaven and a instruction my nor your abroad me warwick world
4
incapable you gracious now a brow thou prince heard and of a

## Question 4:

For this part we are going to do this steps:

1_ load the model of Q1 and Q2

2_ replace the weights of the first layer of Q1 model with the first layer weights of Q2 model

3_ run the evaluate function for the new model

4_ do these steps for any possible pair of modesl.

### Result:

As you can see:

model Q1 with first layer of model Q1 got loss = 4.6

model Q1 with first layer of model Q2 got loss = 9.7

model Q1 with first layer of model Q3 got loss = 8.3

_

model Q2 with first layer of model Q2 got loss = 5.7

model Q2 with first layer of model Q1 got loss = 12.4

model Q2 with first layer of model Q3 got loss = 8.4

_

model Q3 with first layer of model Q3 got loss = 5.7

model Q3 with first layer of model Q2 got loss = 11.6

model Q3 with first layer of model Q1 got loss = 13.1

As you can see here we got lower loss with using the first layer of the model from Q3 (8.3, 8.4, 5.7), therefore, this model is better.



In [123]:
model.load_weights("model.h5")
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model.evaluate(X,y)
print('test loss, test acc, test per:', results)



model.load_weights("model.h5")
model2.load_weights("model2 (1).h5")
model.layers[0].set_weights(model2.layers[0].get_weights())
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model.evaluate(X,y)
print('test loss, test acc, test per:', results)


model.load_weights("model.h5")
model3.load_weights("model33.h5")
model.layers[0].set_weights(model3.layers[0].get_weights())
model.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model.evaluate(X,y)
print('test loss, test acc, test per:', results)


model2.load_weights("model2 (1).h5")
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model2.evaluate(X,y)
print('test loss, test acc, test per:', results)


model.load_weights("model.h5")
model2.load_weights("model2 (1).h5")
model2.layers[0].set_weights(model.layers[0].get_weights())
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model2.evaluate(X,y)
print('test loss, test acc, test per:', results)




model3.load_weights("model33.h5")
model2.load_weights("model2 (1).h5")
model2.layers[0].set_weights(model3.layers[0].get_weights())
model2.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model2.evaluate(X,y)
print('test loss, test acc, test per:', results)



model3.load_weights("model33.h5")
model3.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model3.evaluate(X,y)
print('test loss, test acc, test per:', results)


model3.load_weights("model33.h5")
model2.load_weights("model2 (1).h5")
model3.layers[0].set_weights(model2.layers[0].get_weights())
model3.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model3.evaluate(X,y)
print('test loss, test acc, test per:', results)




model.load_weights("model.h5")
model3.load_weights("model33.h5")
model3.layers[0].set_weights(model.layers[0].get_weights())
model3.compile(loss='categorical_crossentropy', optimizer="adam", metrics=['accuracy', perplexity])
results = model3.evaluate(X,y)
print('test loss, test acc, test per:', results)



test loss, test acc, test per: [4.633732320971615, 0.1484457684923151, 202750.71140057384]
test loss, test acc, test per: [9.747747256219292, 0.013655873936733174, 2506176.3897901885]
test loss, test acc, test per: [8.334856232613259, 0.014601544431813728, 979268.2633886791]
test loss, test acc, test per: [5.767777129532865, 0.12091352749808809, 527884.9857842898]
test loss, test acc, test per: [12.45603134204918, 0.002293128454392223, 5685731.083797186]
test loss, test acc, test per: [8.408440555101336, 0.003767982438947905, 815350.6851895881]
test loss, test acc, test per: [5.7399806672021825, 0.11605777899787939, 420310.28253908566]
test loss, test acc, test per: [11.606718217147607, 0.014733840304182509, 4579286.980028223]
test loss, test acc, test per: [13.14634169536347, 0.009495903727803692, 6585794.33781506]


## Question 5:

As you can see we got thr accuracy about 60%

Also, after epoch ~8,9 the overfitting is possible.


In [46]:
import tarfile
tf = tarfile.open("20Newsgroups_subsampled.tar")
tf.extractall()

"""This code is used to read all news and their labels"""
import os
import glob

def to_categories(name, cat=["politics","rec","comp","religion"]):
    for i in range(len(cat)):
        if str.find(name,cat[i])>-1:
            return(i)
    print("Unexpected folder: " + name) # print the folder name which does not include expected categories
    return("wth")

def data_loader(images_dir):
    categories = os.listdir(data_path)
    news = [] # news content
    groups = [] # category which it belong to
    
    for cat in categories:
        print("Category:"+cat)
        for the_new_path in glob.glob(data_path + '/' + cat + '/*'):
            news.append(open(the_new_path,encoding = "ISO-8859-1", mode ='r').read())
            groups.append(cat)

    return news, list(map(to_categories, groups))



data_path = "20news_subsampled"
news, groups = data_loader(data_path)

Category:rec.sport.hockey
Category:talk.politics.mideast
Category:comp.windows.x
Category:comp.os.ms-windows.misc
Category:comp.sys.ibm.pc.hardware
Category:comp.graphics
Category:soc.religion.christian
Category:talk.politics.misc
Category:rec.autos
Category:rec.sport.baseball
Category:rec.motorcycles
Category:comp.sys.mac.hardware
Category:talk.religion.misc
Category:talk.politics.guns


In [0]:
from keras.preprocessing.text import Tokenizer
import tensorflow as tf
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import text_to_word_sequence

tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')
tokenizer.fit_on_texts(news)

X = []
vocab_size = 0
for i in range(0, len(news)):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts([news[i]])
    encoded = tokenizer.texts_to_sequences([news[i]])[0]
    X.extend(encoded)
    if vocab_size < (len(tokenizer.word_index) + 1):
        vocab_size = len(tokenizer.word_index) + 1

X = tokenizer.texts_to_sequences(wordss)
vocab_size = len(tokenizer.index_word) +1
X = tf.keras.preprocessing.sequence.pad_sequences(X, 750)
X_train, X_test, Y_train, Y_test = train_test_split(X, groups, train_size=0.9,test_size=0.1)
X_train = np.array(X_train) 
X_test = np.array(X_test)
Y_train = to_categorical(np.array(Y_train), num_classes=4) 
Y_test = to_categorical(np.array(Y_test), num_classes=4)   


In [62]:
model5 = Sequential()
model5.add(Embedding(vocab_size, 50))
model5.add(SimpleRNN(500, return_sequences=False, activation='sigmoid'))
model5.add(Dense(4, activation='softmax'))
print(model5.summary())


model5.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model5.fit(X_train, Y_train, epochs=15, batch_size = 64, verbose=1, validation_data=(X_test, Y_test))

Model: "sequential_18"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_18 (Embedding)     (None, None, 50)          7399950   
_________________________________________________________________
simple_rnn_17 (SimpleRNN)    (None, 500)               275500    
_________________________________________________________________
dense_17 (Dense)             (None, 4)                 2004      
Total params: 7,677,454
Trainable params: 7,677,454
Non-trainable params: 0
_________________________________________________________________
None
Train on 11797 samples, validate on 1311 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7ff3464f7358>