## Further ideas
If no embedding_matrix is assigned to the `weights`-argument of the `EmbeddingLayer` and the `trainable`-argument is set to be `True`, then the weights of the EmbeddingLayer are learned during training of the entire network. I.e. in this case no pre-trained word-embedding is required. It would be nice to compare this approach with the approach where pre-trained weights are applied. 

As an alternative to the CNN a LSTM-architecture, e.g. a single LSTM-Layer, followed by Dropout and a Dense Layer at the output, can be applied for classification. The input-embedding layer would be the same as for the CNN. 

| Teammember |                    |
|------------|--------------------|
| 1.         | Christopher Caldwell |
| 2.         | Fabian Müller      |
| 3.         | An Dang         |

In [1]:
from gensim.models import word2vec
parentDir="./extracted"
modelName="./DataSets/dewiki2018skipgram.model"
model=word2vec.Word2Vec.load(modelName)
print("Model is loaded")




Model is loaded


In [2]:
techpath="./Data/GERMAN/TECH/RSS/FeedText"
generalpath="./Data/GERMAN/GENERAL/RSS/FeedText"
catpaths=[techpath,generalpath]

In [3]:
import os
import sys
# second, prepare text samples and their labels
print('Processing text dataset')
texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for catlabel,categorypath in enumerate(catpaths):
    for name in sorted(os.listdir(categorypath)):
        feedpath = os.path.join(categorypath, name)
        if os.path.isdir(feedpath):
            #print(name)
            for fname in sorted(os.listdir(feedpath)):
                fpath = os.path.join(feedpath, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='utf-8')
                t = f.read()
                news=t.split('\n \n')
                for entry in news:
                    if (len(entry)>50) and (entry not in texts): #remove duplicates
                        #if entry not in texts:
                        texts.append(entry)
                        labels.append(catlabel)
                f.close()
print('Found %s texts.' % len(texts))

Processing text dataset
Found 4012 texts.


### Convert texts to sequence of integers

#### Task 10: Transform each text into a sequence of integers
Apply the [Keras Tokenizer class](https://keras.io/preprocessing/text/#tokenizer) in order to uniquely map each word to an integer-index and to represent each text (news-item) as a sequence of integers. The maximum number of words regarded in the Tokenizer shall be `MAX_NB_WORDS=10000`. After fitting the `Tokenizer`-object with the available texts (`fit_on_texts()`), it's attribute `tokenizer.word_index` maps each word to an integer-index. 

In [4]:
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [5]:
MAX_NB_WORDS=10000
# create tokenizer
t = Tokenizer(num_words=MAX_NB_WORDS)
# fit the tokenizer on the docs
t.fit_on_texts(texts)
word_index = t.word_index

In [6]:
from keras.preprocessing.sequence import pad_sequences
import numpy as np 
from keras.utils import to_categorical

# textcorpus to sequences (WordEmbedding)
text_sequences = t.texts_to_sequences(texts)

# Word-Documents into sequences
doc_sequences =  pad_sequences(text_sequences, maxlen=35)

# labels into 2d array labels for Keras
categorical_label = to_categorical(labels, num_classes=2, dtype='float32')

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_sequences,labels, test_size=0.2, random_state=123)

### Load  and prepare Word-Embedding

#### Task 14: Load Word Embedding
Load the Word Embedding, which has been trained and saved in task 3.

In [8]:
from gensim.models import word2vec
modelName="./DataSets/dewiki2018skipgram.model"
w2v_model=word2vec.Word2Vec.load(modelName)

#### Task 15: Prepare `embeddings_index`
Next a Python dictionary `embeddings_index`, which maps words to their vector-representation must be generated.

In [9]:
embedding_index = embedding_index = dict(zip(w2v_model.wv.index2word, w2v_model.wv.syn0))

  """Entry point for launching an IPython kernel.


In [10]:
EMBEDDING_DIM = 200

In [11]:
import numpy as np
embedding_matrix = np.zeros((len(word_index)+1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

In [12]:
from keras.models import Sequential
from keras.layers import Embedding, Dense, Activation, Dropout, LSTM

In [13]:
num_words = len(word_index)+1
EMBEDDING_MATRIX = embedding_matrix
MAX_SEQUENCE_LENGTH = 35

In [18]:
model = Sequential()
model.add(Embedding(num_words, EMBEDDING_DIM, weights=[EMBEDDING_MATRIX], input_length=MAX_SEQUENCE_LENGTH))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 35, 200)           4676800   
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               120400    
_________________________________________________________________
dropout_3 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 101       
Total params: 4,797,301
Trainable params: 4,797,301
Non-trainable params: 0
_________________________________________________________________


In [19]:
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [20]:
model.fit(X_train, y_train, epochs=10, batch_size=100)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f26d5dfd710>

In [21]:
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 90.91%


In [23]:
model = Sequential()
model.add(Embedding(num_words, EMBEDDING_DIM, weights=[EMBEDDING_MATRIX], input_length=MAX_SEQUENCE_LENGTH))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid'))

model.summary()

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=10, batch_size=100)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 35, 200)           4676800   
_________________________________________________________________
dropout_6 (Dropout)          (None, 35, 200)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 100)               120400    
_________________________________________________________________
dropout_7 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 4,797,301
Trainable params: 4,797,301
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 91.66%


## Conclusion

As you can see, if we implement the LSTM network with a dropout layer after the LSTM the training is very good, and the accuracy after the evaluation is at 90.91%. But with two dropout layer bevor and after the LSTM the training is not that good as the model bevore but the accuracy after evaluation is better. In my opinion it is better because the training is less overfitted. 


Source:
https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/