# Text Analysis

In [1]:
import os
import numpy as np
from keras.layers import Activation, Conv1D, Dense, Embedding, Flatten, Input, MaxPooling1D
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets.base import get_data_home
from keras.metrics import categorical_accuracy

data_home = get_data_home()
twenty_home = os.path.join(data_home, "20news_home")

if not os.path.exists(data_home):
    os.makedirs(data_home)
    
if not os.path.exists(twenty_home):
    os.makedirs(twenty_home)
    
!cp ../input/20-newsgroup-sklearn/20news-bydate_py3* /tmp/scikit_learn_data

Using TensorFlow backend.


## Preprocessing the data

You already learned that we have to tokenize the text before we can feed it into a neural network. This tokenization process will also remove some of the features of the original text, such as all punctuation or words that are less common.

In [2]:
# http://qwone.com/~jason/20Newsgroups/
dataset = fetch_20newsgroups(subset='all', shuffle=True, download_if_missing=False)

texts = dataset.data # Extract text
target = dataset.target # Extract target

In [3]:
print (target[:10])

print (len(texts))
print (len(target))
print (len(texts[0].split()))
print (texts[0])
print (target[0])
print (dataset.target_names[target[0]])

[10  3 17  3  4 12  4 10 10 19]
18846
18846
157
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!


10
rec.sport.hockey


Remember we have to specify the size of our vocabulary. Words that are less frequent will get removed. In this case we want to retain the 20,000 most common words.

In [4]:
vocab_size = 20000

tokenizer = Tokenizer(num_words=vocab_size) # Setup tokenizer
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts) # Generate sequences

In [5]:
print (tokenizer.texts_to_sequences(['Hello King, how are you?']))

print (len(sequences))
print (len(sequences[0]))
print (sequences[0])


[[1595, 1168, 82, 20, 13]]
18846
160
[14, 19415, 455, 559, 15, 29, 2552, 1240, 5609, 33, 322, 767, 2175, 2121, 871, 1343, 32, 251, 88, 77, 84, 12087, 455, 559, 15, 7, 122, 228, 63, 3, 2552, 1240, 20, 517, 3490, 50, 1, 1393, 3, 61, 437, 3, 1507, 50, 1, 1302, 2552, 3027, 3, 1, 2701, 309, 7, 122, 243, 16334, 175, 5, 4, 243, 19416, 268, 7, 122, 194, 2, 296, 37, 337, 2, 369, 4389, 22, 4, 243, 3, 7286, 12, 1, 2552, 349, 30, 20, 1502, 137, 2701, 1382, 90, 7, 397, 5987, 74, 2025, 13, 130, 56, 8, 140, 215, 90, 93, 1457, 770, 1963, 56, 8, 97, 4, 308, 9186, 1857, 2, 1306, 6, 1, 2327, 6760, 115, 348, 5987, 21, 4, 308, 3, 1857, 6, 1, 365, 658, 3, 467, 185, 1, 2552, 20, 194, 2, 1985, 1, 66, 3, 3215, 608, 7, 26, 132, 8755, 19, 2, 131, 1, 3280, 2000, 1, 1151, 1457, 770, 283, 2552, 1222]


In [6]:
word_index = tokenizer.word_index
print('Found {:,} unique words.'.format(len(word_index)))

Found 179,209 unique words.


Our text is now converted to sequences of numbers. It makes sense to convert some of those sequences back into text to check what the tokenization did to our text. To this end we create an inverse index that maps numbers to words while the tokenizer maps words to numbers.

In [7]:
# Create inverse index mapping numbers to words
inv_index = {v: k for k, v in tokenizer.word_index.items()}

# Print out text again
for w in sequences[0]:
    x = inv_index.get(w)
    print(x,end = ' ')

from ratnam andrew cmu edu subject pens fans reactions organization post office carnegie mellon pittsburgh pa lines 12 nntp posting host po4 andrew cmu edu i am sure some of pens fans are pretty confused about the lack of any kind of posts about the recent pens massacre of the devils actually i am bit puzzled too and a bit relieved however i am going to put an end to non relief with a bit of praise for the pens man they are killing those devils worse than i thought jagr just showed you why he is much better than his regular season stats he is also a lot fo fun to watch in the playoffs bowman should let jagr have a lot of fun in the next couple of games since the pens are going to beat the out of jersey anyway i was very disappointed not to see the islanders lose the final regular season game pens rule 

### Measuring text length

Let's ensure all sequences have the same length.

In [8]:
# Get the average length of a text
avg = sum(map(len, sequences)) / len(sequences)

# Get the standard deviation of the sequence length
std = np.sqrt(sum(map(lambda x: (len(x) - avg)**2, sequences)) / len(sequences))

avg,std

(292.4769712405816, 666.9329063050876)

You can see, the average text is about 300 words long. However, the standard deviation is quite large which indicates that some texts are much much longer. If some user decided to write an epic novel in the newsgroup it would massively slow down training. So for speed purposes we will restrict sequence length to 100 words. You should try out some different sequence lengths and experiment with processing time and accuracy gains.

In [9]:
print(pad_sequences([[1,2,3]], maxlen=5))
print(pad_sequences([[1,2,3,4,5,6]], maxlen=5))

[[0 0 1 2 3]]
[[2 3 4 5 6]]


In [10]:
max_length = 100
data = pad_sequences(sequences, maxlen=max_length)

## Turning labels into One-Hot encodings

Labels can quickly be encoded into one-hot vectors with Keras:

In [11]:
from keras.utils import to_categorical
labels = to_categorical(np.asarray(target))
print('Shape of data:', data.shape)
print('Shape of labels:', labels.shape)

print (target[0])
print (labels[0])

Shape of data: (18846, 100)
Shape of labels: (18846, 20)
10
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


## Loading GloVe embeddings


In [12]:
glove_dir = '../input/glove-global-vectors-for-word-representation' # This is the folder with the dataset

embeddings_index = {} # We create a dictionary of word -> embedding

with open(os.path.join(glove_dir, 'glove.6B.100d.txt')) as f:
    for line in f:
        values = line.split()
        word = values[0] # The first value is the word, the rest are the values of the embedding
        embedding = np.asarray(values[1:], dtype='float32') # Load embedding
        embeddings_index[word] = embedding # Add embedding to our embedding dictionary

print('Found {:,} word vectors in GloVe.'.format(len(embeddings_index)))

Found 400,000 word vectors in GloVe.


In [13]:
print (embeddings_index['frog'])
print (len(embeddings_index['frog']))

[ 0.043084   0.53233    0.54254   -0.076952  -0.29673    0.52986
  0.21379    0.15789   -0.3952    -0.91889   -0.6585     0.68706
  0.10821   -0.10694   -0.3401     1.044      0.12775    0.51157
  0.60314    0.71366   -0.5374     0.37737    0.12186    0.60891
  0.50107    2.0215    -0.47318    0.46953    0.12542    0.60207
  0.11007    0.37587    1.0137    -0.2478     0.65748    0.12801
 -0.57647   -0.25754    0.62426    0.010864  -0.40681    0.16173
 -0.84695   -0.24603    0.29078    0.8546    -0.067021   0.69331
 -0.71545   -0.25184   -0.74741   -0.26507    0.4873     0.41991
 -0.86741   -0.5235    -0.44774   -0.044584   0.033836   0.29909
  0.73754    0.81651    0.69431    0.80453    0.29276   -0.025244
 -0.30453   -0.34329    0.11933   -0.29655    0.1072    -0.18946
  0.18501   -0.7548    -0.25628    0.34438   -0.016743   0.0040503
  0.39342    0.99404   -0.32159   -0.49434    0.41708   -0.011019
 -0.16613   -0.20839    0.28152   -0.82996    0.79839    0.61645
  0.31537   -0.27629 

In [14]:
print (np.linalg.norm(embeddings_index['man'] - embeddings_index['woman']))
print (np.linalg.norm(embeddings_index['man'] - embeddings_index['cat']))

# https://nlp.stanford.edu/projects/glove/
print (np.linalg.norm(embeddings_index['frog'] - embeddings_index['toad']))
print (np.linalg.norm(embeddings_index['frog'] - embeddings_index['man']))

print (np.linalg.norm(embeddings_index['frog'] - embeddings_index['fog']))

print (np.linalg.norm(embeddings_index['frog'] - embeddings_index['fork']))
print (np.linalg.norm(embeddings_index['frog'] - embeddings_index['skyscraper']))

3.364068
5.197995
4.1249743
6.7943544
7.3115597
6.5261197
7.450874


In [15]:
embedding_dim = 100 # We use 100 dimensional glove vectors

word_index = tokenizer.word_index
nb_words = min(vocab_size, len(word_index)) # How many words are there actually

embedding_matrix = np.zeros((nb_words, embedding_dim))

# The vectors need to be in the same position as their index. 
# Meaning a word with token 1 needs to be in the second row (rows start with zero) and so on

# Loop over all words in the word index
for word, i in word_index.items():
    # If we are above the amount of words we want to use we do nothing
    if i >= vocab_size: 
        continue
    # Get the embedding vector for the word
    embedding_vector = embeddings_index.get(word)
    # If there is an embedding vector, put it in the embedding matrix
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector

In [16]:
print (embedding_matrix[100])

[-1.01400006  0.078819    0.47789001 -0.71000999  0.40336999 -0.013396
 -0.070241   -0.12796    -0.80293     0.58372998  0.27814999 -1.13110006
 -1.24269998  0.065034   -0.29615    -0.21926001 -0.11177     0.20290001
 -0.30069     0.45559001  0.98092002  0.32383999  0.11154     0.42175001
 -0.71700001  1.14900005 -0.26041001  0.59004998 -0.62717998  0.089107
  0.52561003  0.39030001 -0.10446     0.30394     0.58774     0.20553
  0.62854999  0.40913999  0.93089998  0.68953001 -0.058053   -1.03429997
 -0.1621     -0.59283    -0.46384001 -0.12187    -0.64608997 -0.099373
  0.32742    -0.45748001  0.11268    -0.71411997  0.54689002 -0.60856003
  0.16841    -1.85210001  0.34494001 -0.31538001  0.72078001  0.73034
  0.30781001 -0.40831     0.24587999 -0.396       0.82898003  0.43457001
  1.84220004  1.47230005  0.072308   -0.074464    0.29550001  0.3768
 -0.78801     0.20651001  0.74176002 -0.61141998 -0.10625    -0.46015999
 -0.81893998 -1.03250003 -1.11150002 -1.53659999  0.20482001 -1.038

In [17]:
model = Sequential()
model.add(Embedding(vocab_size, 
                    embedding_dim, 
                    input_length=max_length, 
                    weights = [embedding_matrix], 
                    trainable = False))
model.add(Conv1D(128, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(128, 3, activation='relu'))
model.add(MaxPooling1D(3))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(20, activation='softmax'))
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          2000000   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 98, 128)           38528     
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 32, 128)           0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 30, 128)           49280     
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 10, 128)           0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 8, 128)            49280     
_________________________________________________________________
max_

In [18]:
# model.compile(optimizer='adam',
#               loss='binary_crossentropy',  # https://stackoverflow.com/questions/42081257/keras-binary-crossentropy-vs-categorical-crossentropy-performance
#               metrics=['accuracy'])

# https://stackoverflow.com/questions/42081257/keras-binary-crossentropy-vs-categorical-crossentropy-performance
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=[categorical_accuracy])

model.fit(data, labels, validation_split=0.2, epochs=10)

Instructions for updating:
Use tf.cast instead.
Train on 15076 samples, validate on 3770 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fa4bc6083c8>

Our model achieves 63% accuracy on the validation set. Systems like these can be used to assign emails in customer support centers, suggest responses, or classify other forms of text like invoices which need to be assigned to an department. Let's take a look at how our model classified one of the texts:

In [19]:
example = data[400] # get the tokens
print (texts[400])

# Print tokens as text
for w in example:
    x = inv_index.get(w)
    print(x,end = ' ')

From: JC924@uacsc2.albany.edu
Subject: Why are our desktop fonts changing?
Organization: University at Albany, Albany NY 12222
X-Newsreader: NNR/VM S_1.3.2
Lines: 17

One of our users is having an unusual problem.  If she does an Alt/Tab to
a full-screen DOS program, when she goes back to Windows her desktop fonts
have changed.  If she goes back to a full-screen DOS program and then goes
back to Windows, the font has changed back to its default font.  It's not
a major problem (everything works and the font is legible), but it is
annoying.  Does anyone have any idea why this happens.  By the way, she
has a DEC 486D2LP machine.
 
 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Jeffrey M. Cohen                      Voice: 518-442-3510
Office for Research (AD 218)          Fax:   518-442-3560
The University at Albany              E-mail: JC924@uacsc2.albany.edu
State University of New York
1400 Washington Ave.
Albany, NY 12222
++++++++++++++++++++++++++++++++++

In [20]:
# Get prediction
pred = model.predict(example.reshape(1,100))

In [21]:
# Output predicted category
dataset.target_names[np.argmax(pred)]

'comp.os.ms-windows.misc'