# Setup

## Google Only

In [21]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
import os
import sys
sys.path.append(os.path.dirname(os.path.abspath('')))
sys.path.append('/content/drive/My Drive/Code/autocomplete_me')

In [23]:
!ls -ls '/content/drive/My Drive/Code/autocomplete_me'

total 17
4 drwx------ 3 root root 4096 Jul 11 13:00 data
4 drwx------ 2 root root 4096 Jul 11 13:26 models
4 drwx------ 2 root root 4096 Jul 11 13:01 notebooks
1 -rw------- 1 root root   57 Jul 11 08:44 requirements.txt
4 drwx------ 3 root root 4096 Jul 11 13:00 src


In [24]:
from src import utils, reader, predict_utils

In [5]:
from importlib import reload
reload(utils)
reload(reader)
reload(predict_utils)

<module 'src.predict_utils' from '/content/drive/My Drive/Code/autocomplete_me/src/predict_utils.py'>

# Build Model

In [25]:
text = reader.read_bbc_tech()
content_type = 'BBC-tech'

In [10]:
text[0]

'US duo in first spam conviction\n\nA brother and sister in the US have been convicted of sending hundreds of thousands of unsolicited e-mail messages to AOL subscribers.\n\nIt is the first criminal prosecution of internet spam distributors. Jurors in Virginia recommended that the man, Jeremy Jaynes, serve nine years in prison and that his sister, Jessica DeGroot, be fined $7,500. They were convicted under a state law that bars the sending of bulk e-mails using fake addresses.\n\nThey will be formally sentenced next year. A third defendant, Richard Rutkowski, was acquitted. Prosecutors said Jaynes was "a snake oil salesman in a new format", using the internet to peddle useless wares, news agency Associated Press reported. A "Fed-Ex refund processor" was supposed to allow people to earn $75 an hour working from home. Another item on sale was an "internet history eraser". His sister helped him process credit card payments. Jaynes amassed a fortune of $24m from his sales, prosecutors said

## Training

### Long Text

In [11]:
training_dict, word_idx, idx_word, sequences, num_words = utils.get_data(text)

There are 15491 unique words.
There are 199462 sequences.


In [12]:
# embedding_matrix = utils.create_embedding_matrix(word_idx, num_words, '/Users/jaipancholi/data/glove.6B.100d.txt')
embedding_matrix = utils.create_embedding_matrix(word_idx, num_words, '/content/drive/My Drive/Code/autocomplete_me/data/glove.6B.100d.txt')
embedding_matrix

Glove Vectors loading with dimension 100
There were 5692 words without pre-trained embeddings.


  embedding_matrix = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1).reshape((-1, 1))


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.00656124, -0.04206555,  0.12508174, ..., -0.02506376,
         0.14220549,  0.04648907],
       [-0.06223089,  0.03835242,  0.08488411, ..., -0.04284497,
         0.08662399, -0.00527513],
       ...,
       [-0.00346489, -0.08874972, -0.20891733, ...,  0.09497839,
        -0.12697276, -0.08340222],
       [-0.02349926, -0.06641477, -0.01852101, ...,  0.07821092,
         0.12694902,  0.1010958 ],
       [ 0.02721107,  0.00886596,  0.24643607, ...,  0.02826212,
         0.23311737,  0.06475616]])

#### Unidirectional, 1 layer, Untrainable

In [13]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=False,
    lstm_layers=1,
    bi_direc=False
)

In [14]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 100)         1549100   
_________________________________________________________________
masking_1 (Masking)          (None, None, 100)         0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 64)                42240     
_________________________________________________________________
dense_1 (Dense)              (None, 128)               8320      
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 15491)             1998339   
Total params: 3,597,999
Trainable params: 2,048,899
Non-trainable params: 1,549,100
____________________________________

In [91]:
history = utils.train_model(
    training_dict,
    f'{content_type}_uni-1_layer-untrainable-50_seq',
    model=model,
    epochs=50
)

Train on 139623 samples, validate on 59839 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### Unidirectional, 2 layers, Untrainable

In [15]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=False,
    lstm_layers=2,
    bi_direc=False
)

In [16]:
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, None, 100)         1549100   
_________________________________________________________________
masking_2 (Masking)          (None, None, 100)         0         
_________________________________________________________________
lstm_2 (LSTM)                (None, None, 64)          42240     
_________________________________________________________________
lstm_3 (LSTM)                (None, 64)                33024     
_________________________________________________________________
dense_3 (Dense)              (None, 128)               8320      
_________________________________________________________________
dropout_2 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 15491)            

In [93]:
history = utils.train_model(
    training_dict,
    f'{content_type}_uni-2_layers-untrainable-50_seq',
    model=model,
    epochs=50
)

Train on 139623 samples, validate on 59839 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### Bidirectional, 1 layer, Untrainable

In [17]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=False,
    lstm_layers=1,
    bi_direc=True
)

In [18]:
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, None, 100)         1549100   
_________________________________________________________________
masking_3 (Masking)          (None, None, 100)         0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 128)               84480     
_________________________________________________________________
dense_5 (Dense)              (None, 128)               16512     
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 15491)             1998339   
Total params: 3,648,431
Trainable params: 2,099,331
Non-trainable params: 1,549,100
____________________________________

In [95]:
history = utils.train_model(
    training_dict,
    f'{content_type}_bi-1_layer-untrainable-50_seq',
    model=model,
    # use_pretrained_model=True,
    epochs=50
)

Train on 139623 samples, validate on 59839 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### Bidirectional, 2 layers, Untrainable

In [96]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=False,
    lstm_layers=2,
    bi_direc=True
)

In [97]:
history = utils.train_model(
    training_dict,
    f'{content_type}_bi-2_layer-untrainable-50_seq',
    model=model,
    # use_pretrained_model=True,
    epochs=50
)

Train on 139623 samples, validate on 59839 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### Unidirectional, 1 layer, Trainable

In [98]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=True,
    lstm_layers=1,
    bi_direc=False
)

In [99]:
train_history = utils.train_model(
    training_dict,
    f'{content_type}_uni-1_layer-trainable-50_seq',
    model=model,
    # use_pretrained_model=True,
    epochs=50
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 139623 samples, validate on 59839 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
train_history

#### Bidirectional, 1 layer, Trainable

In [100]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=True,
    lstm_layers=1,
    bi_direc=True
)

In [101]:
history = utils.train_model(
    training_dict,
    f'{content_type}_bi-1_layer-trainable-50_seq',
    model=model,
    # use_pretrained_model=True,
    epochs=50
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 139623 samples, validate on 59839 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


### Short Text

In [9]:
training_dict, word_idx, idx_word, sequences, num_words = utils.get_data(text, training_len=20)

There are 15491 unique words.
There are 211492 sequences.


In [10]:
# embedding_matrix = utils.create_embedding_matrix(word_idx, num_words, '/Users/jaipancholi/data/glove.6B.100d.txt')
embedding_matrix = utils.create_embedding_matrix(word_idx, num_words, '/content/drive/My Drive/Code/autocomplete_me/data/glove.6B.100d.txt')
embedding_matrix

Glove Vectors loading with dimension 100
There were 5692 words without pre-trained embeddings.


  embedding_matrix = embedding_matrix / np.linalg.norm(embedding_matrix, axis=1).reshape((-1, 1))


array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.00656124, -0.04206555,  0.12508174, ..., -0.02506376,
         0.14220549,  0.04648907],
       [-0.06223089,  0.03835242,  0.08488411, ..., -0.04284497,
         0.08662399, -0.00527513],
       ...,
       [-0.00346489, -0.08874972, -0.20891733, ...,  0.09497839,
        -0.12697276, -0.08340222],
       [-0.02349926, -0.06641477, -0.01852101, ...,  0.07821092,
         0.12694902,  0.1010958 ],
       [ 0.02721107,  0.00886596,  0.24643607, ...,  0.02826212,
         0.23311737,  0.06475616]])

#### Bidirectional, 2 layers, Trainable

In [11]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=True,
    lstm_layers=2,
    bi_direc=True
)

In [12]:
history = utils.train_model(
    training_dict,
    f'{content_type}_bi-2_layer-trainable-20_seq',
    model=model,
    # use_pretrained_model=True,
    epochs=50
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 148044 samples, validate on 63448 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


#### Unidirectional, 1 layer, Trainable

In [13]:
model = utils.make_word_level_model(
    num_words,
    embedding_matrix,
    lstm_cells=64,
    trainable=True,
    lstm_layers=1,
    bi_direc=False
)

In [15]:
history = utils.train_model(
    training_dict,
    f'{content_type}_uni-1_layer-trainable-20_seq',
    model=model,
    # use_pretrained_model=True,
    epochs=50
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 148044 samples, validate on 63448 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


## Compare Model

<tr>
    <th>Name of Model</th>
    <th>LTSM Layers</th>
    <th>Bidirectional</th>
    <th>Trainable Embeddings</th>
    <th>Sequence Length</th>
    <th>Val Loss</th>
    <th>Val Accuracy</th>
</tr>
<tr>
    <td></td>
    <td>1</td>
    <td>False</td>
    <td>False</td>
    <td>50</td>
    <td>6.4511</td>
    <td>0.1130</td>
</tr>
<tr>
    <td></td>
    <td>2</td>
    <td>False</td>
    <td>False</td>
    <td>50</td>
    <td>6.5666</td>
    <td>0.1039</td>
</tr>
<tr>
    <td></td>
    <td>1</td>
    <td>True</td>
    <td>False</td>
    <td>50</td>
    <td>6.3906</td>
    <td>0.1154</td>
</tr>
<tr>
    <td></td>
    <td>2</td>
    <td>True</td>
    <td>False</td>
    <td>50</td>
    <td>6.5250</td>
    <td>0.1065</td>
</tr>
<tr>
    <td></td>
    <td>1</td>
    <td>False</td>
    <td>True</td>
    <td>50</td>
    <td>6.6767</td>
    <td>0.1770</td>
</tr>
<tr>
    <td></td>
    <td>1</td>
    <td>True</td>
    <td>True</td>
    <td>50</td>
    <td>6.9115</td>
    <td>0.1669</td>
</tr>
<tr>
    <td></td>
    <td>2</td>
    <td>True</td>
    <td>True</td>
    <td>20</td>
    <td>6.8365</td>
    <td>0.1451</td>
</tr>
<tr>
    <td></td>
    <td>1</td>
    <td>False</td>
    <td>True</td>
    <td>20</td>
    <td>6.6482</td>
    <td>0.1757</td>
</tr>

# Sample Output

## Load Model

In [26]:
from keras.models import load_model
# Get Model Weights and Architecture
# MODELS_DIR = os.path.join(os.path.dirname(os.path.abspath('')), 'models')
MODELS_DIR = '/content/drive/My Drive/Code/autocomplete_me/models'
model_filepath = os.path.join(MODELS_DIR, f'{content_type}_uni-1_layer-trainable-20_seq.h5')

model = load_model(model_filepath)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


In [27]:
# Get Text Data
# text = reader.read_bbc_tech()

TRAINING_LENGTH = 50
training_dict, word_idx, idx_word, sequences, num_words = utils.get_data(text, training_len=TRAINING_LENGTH)

There are 15491 unique words.
There are 199462 sequences.


In [28]:
original_sequence, gen_list, a = predict_utils.generate_output(
    model,
    sequences,
    idx_word,
    seed_length=TRAINING_LENGTH,
    new_words=50,
    diversity=1,
    n_gen=1
)

In [29]:
' '.join(word for word in original_sequence)

"own name . Part of the surge in people signing up was due to BT stretching the reach of ADSL - the UK's most widely used way of getting broadband - beyond six kilometres . Asymmetric Digital Subscriber Line technology lets ordinary copper phone lines support high data speeds ."

In [30]:
' '.join(word for word in gen_list[0])

'< --- > internet Bysshe are left recently said , says . All unsolicited attacks offered in data and very . The move is open on universal successfully growing announced for this cell than a premium cards . We means only different software , which are then pay in music . Blogs ,'

In [33]:
' '.join(word for word in a)

'< --- > The standard speed is 512kbps , though faster connections are available . According to BT , more than 95 of UK homes and businesses can receive broadband over the phone line . It aims to extend this figure to 99.4 by next summer . There are also an estimated 1.7'

In [None]:
# Custom Sentence

In [34]:
sentence = 'Stocks of major large technology firms are becoming'
predict_utils.generate_custom_sentence(sentence, word_idx, idx_word, model, new_words=50)

[None, 5, 524, 490, 47, 130, 16, 503]
conference . The Gracenote service can the firm Centre charge of others , being 92.9 of dedicated online trialled changing file-swapping , says months . He has also seen more granted so face the part of early approach and the Entropia Athlon accounts director of reconditioned graphics part of the


In [37]:
sentence = 'However, there have been many instances of'
predict_utils.generate_custom_sentence(sentence, word_idx, idx_word, model, new_words=50)

[None, 96, 21, 45, 81, 8389, 5]
the huge success . Santy starts 38mph Love , megapixel media could be committed to adhere to learn the services to have quickly . It ever looked , right to be like online with recent difference and wi-fi card on goods directly on their customers and killed inside home different
