## Language modeling and generation -- a very basic example

This notebook provides a dirty and quick way to train a language model (LM) and generate text from it. It does not correspond to what's done in practice which requires a lot more manipulation to get the model state and pass it on from one cell to another: see, e.g., https://www.tensorflow.org/text/tutorials/text_generation?hl=en. For a first approach to language modeling and generation, I'd rather avoid this loop and go for a simpler option.

In this very basic version, we will thus train a LSTM as an n-gram model to encode a fixed number of tokens (say n, so we consider a n+1-gram model) and predict the next token from a fixed length history. Preparing training data is easy as all inputs are fixed length. So is prediction as long as we have a seed of n tokens to start with. We do so to illustrate n-grams, the notion of summarizing a sequence with a RNN state (the last one in this case), LSTM training and the most simple loop of text generation.


In [1]:
#
# load a bunch of modules
#

import json
import numpy as np
import random
from tqdm import tqdm
from nltk import word_tokenize
import tensorflow as tf

from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.callbacks import EarlyStopping, ModelCheckpoint


2023-12-05 20:14:42.276547: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-05 20:14:42.276618: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-05 20:14:42.276657: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-05 20:14:42.286623: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
tf.config.list_physical_devices('GPU')

2023-12-05 20:14:44.944325: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-05 20:14:44.956327: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-05 20:14:44.956500: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

In [3]:
#
# load IMDb data and process a small number of samples (positive here)
#
fn = 'imdb-trn.json'

with open(fn, 'rt') as f:
    imdb_data = json.load(f)
    

imdb_data_small = imdb_data[:2000] + imdb_data[-2000:]


def clean_utterance(buf):
    '''
    Clean the list of tokens.
    '''
    ignore = ("``", "''", "(", ")", '<', 'br', '/', '>', '--', '*', '-')
    
    return [x.lower() for x in buf if x.lower() not in ignore]


#
# tokenize IMDb texts with NLTK tokenizer after lowercasing
#
pos_texts = [clean_utterance(word_tokenize(x[1])) for x in imdb_data if x[0] == 'pos']

for i in range(10):
    print('tokenized text[i] =', pos_texts[i])


tokenized text[i] = ['for', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'imagine', 'a', 'movie', 'where', 'joe', 'piscopo', 'is', 'actually', 'funny', '!', 'maureen', 'stapleton', 'is', 'a', 'scene', 'stealer', '.', 'the', 'moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'watch', 'for', 'alan', 'the', 'skipper', 'hale', 'jr.', 'as', 'a', 'police', 'sgt', '.']
tokenized text[i] = ['bizarre', 'horror', 'movie', 'filled', 'with', 'famous', 'faces', 'but', 'stolen', 'by', 'cristina', 'raines', 'later', 'of', 'tv', "'s", 'flamingo', 'road', 'as', 'a', 'pretty', 'but', 'somewhat', 'unstable', 'model', 'with', 'a', 'gummy', 'smile', 'who', 'is', 'slated', 'to', 'pay', 'for', 'her', 'attempted', 'suicides', 'by', 'guarding', 'the', 'gateway', 'to', 'hell', '!', 'the', 'scenes', 'with', 'raines', 'modeling', 'are', 'very', 'well', 'captured', ',', 'the', 'mood', 'music', 'is', '

### Prepare the vocabulary on which we will operate

This is the standard stuff that you'll rapidly get used to. Here we make a list of tokens in the dataset with the number of occurrences for each then (severely) limit the vocabulary to tokens appearing more than MINOCC times (defaults to 30). We add an <unk> token to replace tokens not in the selected vocabulary. We finally construct a word2id mapping (dictionary) and the inverse id2word mapping (list).
    
We provide in the next cell two useful functions to encode/decode a sequence: list of strings to list of integers for the former, and conversely for the latter.

In [4]:

def get_token_counts(idata, use_lemma = False):
    '''
    Create vocabulary from a bunch of (tokenized) texts. If use_lemma == True, use lemma rather than
    tokens. 
    
    Returns:
        - token count (dict)
    '''

    tokcnt = {}    
    
    for utterance in idata:
        for i, token in enumerate(utterance):
            tokcnt[token] = 1 if token not in tokcnt else tokcnt[token] + 1

    return dict(sorted(tokcnt.items(), key=lambda x: x[1], reverse = True))


count = get_token_counts(pos_texts)

#
# Pretty print a number of things
#
print('total number of tokens in dataset =', len(count))
print('most frequent tokens:')
for x in list(count.keys())[:20]:
    print(f"   {x:20}  {count[x]}")
print('\nleast frequent tokens:')
for x in list(count.keys())[-20:]:
    print(f"   {x:20}  {count[x]}")

#
# Select vocabulary: here, we will limit ourselves to tokens occurring at least MINOCC times and map the remaining
# ones to <unk>. So IDs of actual tokens will start at 1 and we reserve 0 for <unk>.
#

MINOCC = 30

word2id = {x: i+1 for i, x in enumerate([x for x in count if count[x] > MINOCC])}
word2id = {'<unk>': 0, **word2id} # this is a quick way to merge dictionaries in pyton 3.9 -- see https://stackoverflow.com/questions/38987/how-do-i-merge-two-dictionaries-in-a-single-expression-in-python

vocsize = len(word2id)

print('\ntotal number of tokens in vocab =', vocsize)
print(list(word2id.items())[:20])

# also reverse mapping for pretty printing
id2word = list(word2id.keys())
print(id2word[:20])


total number of tokens in dataset = 77041
most frequent tokens:
   the                   172318
   ,                     144077
   .                     117678
   and                   89398
   a                     83300
   of                    76630
   to                    66455
   is                    58467
   in                    49797
   it                    47350
   i                     40267
   that                  35526
   this                  34881
   's                    32132
   as                    26253
   with                  23197
   was                   22685
   for                   22303
   but                   20731
   film                  20284

least frequent tokens:
   vulgarities           1
   rÃªves                 1
   objectifier           1
   disaster.one          1
   ketty                 1
   konstadinou           1
   kavogianni            1
   'guilty               1
   laughing.every        1
   heart.my              1
   vassilis       

In [5]:
#
# Define utility function. To simplify the code, we will use global variables (which is not recommended)
# 

def encode_sequence(data):
    '''
    Return the encoded sequence given the tokens' strings and the word2id mapping. We assume <unk> is at index 0
    '''
    global word2id
    
    return [word2id.get(x, 0) for x in data]

def decode_sequence(data):
    '''
    Return the decoded sequence given the tokens' encodings and id2word mapping
    '''
    global id2word
    
    return [id2word[x] for x in data]
            
  
enc = encode_sequence(pos_texts[0])
print(pos_texts[0])
print(enc)
print(decode_sequence(enc)) 
# print(sequence_has_unk(enc))
# print(sequence_has_unk(enc[:10]))

['for', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'imagine', 'a', 'movie', 'where', 'joe', 'piscopo', 'is', 'actually', 'funny', '!', 'maureen', 'stapleton', 'is', 'a', 'scene', 'stealer', '.', 'the', 'moroni', 'character', 'is', 'an', 'absolute', 'scream', '.', 'watch', 'for', 'alan', 'the', 'skipper', 'hale', 'jr.', 'as', 'a', 'police', 'sgt', '.']
[18, 5, 21, 12, 231, 90, 1144, 46, 268, 26, 5, 166, 6, 648, 4172, 4427, 18, 13, 1031, 3, 918, 5, 21, 125, 772, 0, 8, 190, 182, 33, 5351, 0, 8, 5, 136, 0, 3, 1, 0, 114, 8, 36, 1744, 2434, 3, 117, 18, 1283, 1, 0, 0, 1837, 15, 5, 531, 6210, 3]
['for', 'a', 'movie', 'that', 'gets', 'no', 'respect', 'there', 'sure', 'are', 'a', 'lot', 'of', 'memorable', 'quotes', 'listed', 'for', 'this', 'gem', '.', 'imagine', 'a', 'movie', 'where', 'joe', '<unk>', 'is', 'actually', 'funny', '!', 'maureen', '<unk>', 'is', 'a', 'scene', '<unk>', '.', 'the',

### Train model

We firsty prepare training data for this simplified problem. Training data consists of fixed-length sequences with the corresponding label, i.e., the token that follows (all of these encoded as integers of course):
   ['for', 'a', 'movie', 'that', 'gets']  >>  no
   ['that', 'gets', 'no', 'respect', 'there']  >>  sure
   ['respect', 'there', 'sure', 'are', 'a']  >>  lot
   ['are', 'a', 'lot', 'of', 'memorable']  >>  quotes
   
As we will design a closed vocabulary LM (i.e., no possibility of assigning probability to the <unk> token), we discard sequences (history and label) where <unk> appears. As training data are documents rather than sentences, we also avoid sequences with end of sentence punctuation marks in the history (here, only the period is considered).
    
We then define the model's architecture, wich is simply a LSTM to encore the history followed by a dense projection. And infally train the model, which might take some time if you have no GPU.

In [6]:
#
# That's where we will simplify things very seriously to train a LSTM that can predict the following token. We will
# artificially build fixed length input sequences from the training data with the following word as the label to 
# predict. This is a rather dirty hack but highly facilitates life.
#

input_length = 5     # increase to account for longer histories
step = 3             # reduce to yield more training samples

X = []
Y = []

nignored = 0

for i, utterance in enumerate(pos_texts):
    # print('utterance', i, '=', utterance)
    
    for j in range(0, len(utterance) - input_length, step):
        buf = encode_sequence(utterance[j:j+input_length])
        label = word2id.get(utterance[j+input_length], 0) # assuming <unk> at index 0
        
        # let's make our life easier and stick to sequences and labels that are not <unk>,
        # also disregarding sequences with punctuation marks in the history (so we can stop)
        # generation whenever a question mark is selected. We limit ourselves to '.' but
        # should consider '!' and '?' also if things were to be done properly.
        
        if 0 not in buf and word2id['.'] not in buf and label != 0:
            X.append(buf)
            Y.append(label)
            # print('     j =', j, decode_sequence(buf), ' >>>', id2word[label], '     [ok]')
        else:
            nignored += 1
            # print('     j =', j, decode_sequence(buf), ' >>>', id2word[label], '     [ignored]')

print('number of sequences for training =', len(X))
print('number of sequences fdiscarded =', nignored)

#
# Finally, convert data to numpy for later use with tf.keras (not sure it is necessary)
#
X = np.array(X)
Y = np.array(Y)
for i in range(20):
    print('  ', decode_sequence(X[i]), ' >> ', id2word[Y[i]])

number of sequences for training = 549932
number of sequences fdiscarded = 527827
   ['for', 'a', 'movie', 'that', 'gets']  >>  no
   ['that', 'gets', 'no', 'respect', 'there']  >>  sure
   ['respect', 'there', 'sure', 'are', 'a']  >>  lot
   ['are', 'a', 'lot', 'of', 'memorable']  >>  quotes
   ['of', 'memorable', 'quotes', 'listed', 'for']  >>  this
   ['character', 'is', 'an', 'absolute', 'scream']  >>  .
   ['jr.', 'as', 'a', 'police', 'sgt']  >>  .
   ['bizarre', 'horror', 'movie', 'filled', 'with']  >>  famous
   ['filled', 'with', 'famous', 'faces', 'but']  >>  stolen
   ['to', 'hell', '!', 'the', 'scenes']  >>  with
   ['very', 'well', 'captured', ',', 'the']  >>  mood
   [',', 'the', 'mood', 'music', 'is']  >>  perfect
   [',', 'but', 'when', 'raines', 'moves']  >>  into
   ['raines', 'moves', 'into', 'a', 'creepy']  >>  brooklyn
   ['by', 'a', 'blind', 'priest', 'on']  >>  the
   ['priest', 'on', 'the', 'top', 'floor']  >>  ,
   ['top', 'floor', ',', 'things', 'really']  >>  

In [7]:
#
# Define the model and the hyperparameters such as embedding dimension and LSTM state dimension.
#


embedding_size = 100
lstm_size = 100

model = Sequential()
model.add(Embedding(vocsize, embedding_size, input_length = input_length))
model.add(LSTM(lstm_size))
model.add(Dropout(0.1))
model.add(Dense(vocsize, activation = 'softmax'))

model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam')
print(model.summary())



2023-12-05 20:15:00.999051: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-05 20:15:00.999290: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-12-05 20:15:00.999437: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 5, 100)            650500    
                                                                 
 lstm (LSTM)                 (None, 100)               80400     
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense (Dense)               (None, 6505)              657005    
                                                                 
Total params: 1387905 (5.29 MB)
Trainable params: 1387905 (5.29 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [8]:
#
# And finally run training with an early stopping criterion
#


epochs = 40
batch_size = 128
val_split = 0.2

stop = EarlyStopping(monitor = 'val_loss', min_delta = 0, patience = 5, verbose = 1, mode = 'auto')
save = ModelCheckpoint('data.NOSAVE/lstm-10-5.x', monitor = 'val_loss', verbose = 0, save_best_only = True)

history = model.fit(X, Y, batch_size = batch_size, epochs = epochs, verbose = 1, validation_split = val_split, callbacks = [stop, save])
    
model.save('data.NOSAVE/lstm-10-5-final.x')

Epoch 1/40


2023-12-05 20:15:25.383106: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8700
2023-12-05 20:15:25.955475: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb008b346c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-12-05 20:15:25.955494: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3060 Laptop GPU, Compute Capability 8.6
2023-12-05 20:15:25.959364: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-12-05 20:15:26.021020: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.




INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5.x/assets


Epoch 2/40


INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5.x/assets


Epoch 3/40


INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5.x/assets


Epoch 4/40


INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5.x/assets


Epoch 5/40


INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5.x/assets


Epoch 6/40


INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5.x/assets


Epoch 7/40


INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5.x/assets


Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 12: early stopping
INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5-final.x/assets


INFO:tensorflow:Assets written to: data.NOSAVE/lstm-10-5-final.x/assets


### Playing with the model

Now we have a properly trained model able to take in a few tokens and output a probability distribution function over the vocabulary for the next position (i.e., p[.|h]). And we can do a few things with this:
  - visualize the distribution p[.|h] for a given history
  - see in actual utterances how prediction differs from reality (also looking at probabilities)
  - compute perplexity on a smal dataset (could be done with + and - samples)
  - generate reviews from a prompt
  
We will not do every of these (you can do that for yourself) but only go through the second and fourth points.

We will first prepare unseen data in the same way as above and see how prediction

In [9]:
#
# Get a few unseen texts from the database and process them
#
[clean_utterance(word_tokenize(x[1])) for x in imdb_data if x[0] == 'pos']

pos_tests = [clean_utterance(word_tokenize(x[1])) for x in imdb_data[2000:2050]]
neg_tests = [clean_utterance(word_tokenize(x[1])) for x in imdb_data[-50:]]

X1 = []
y1 = []

for utterance in pos_tests:
    for j in range(0, len(utterance) - input_length, step):
        buf = encode_sequence(utterance[j:j+input_length])
        label = word2id.get(utterance[j+input_length], 0) # assuming <unk> at index 0
        if 0 not in buf and word2id['.'] not in buf and label != 0:
            X1.append(buf)
            y1.append(label)

print('number of positive sequences for testing =', len(X1))

X2 = []
y2 = []

for utterance in neg_tests:
    for j in range(0, len(utterance) - input_length, step):
        buf = encode_sequence(utterance[j:j+input_length])
        label = word2id.get(utterance[j+input_length], 0) # assuming <unk> at index 0
        if 0 not in buf and word2id['.'] not in buf and label != 0:
            X2.append(buf)
            y2.append(label)

print('number of negative sequences for testing =', len(X2))

number of positive sequences for testing = 2142
number of negative sequences for testing = 2405


In [10]:

def predict(model, h, mode = 'best', true_i = None):
    '''
    Return a predicted token given the history and the model. Said more simply, predict p[.|h]
    with the model and take the best guess or a random guess (depending on mode).
    
    Note the np.newaxis trick to make believe we have a batch (of size 1) as the model expects
    batches.
    
    Returns predicted token with the corresponding probability, optionnally returning the activation 
    prob of the true token if true_i is provided
    '''
    
    # pred = model.predict(h[np.newaxis, :], verbose = 0)
    pred = model.predict([h], verbose = 0)
    
    if mode == 'best':
        pred_i = np.argmax(pred)
    else:
        pred_i = random.choice(len(pred), pred)
    
    return pred_i, pred[0][pred_i], 0 if true_i == None else pred[0][true_i]


average_prob = 0
i_print = 0

for history, label in zip(X1,y1):
    best, pred_prob, true_prob = predict(model, history, true_i = label)
    average_prob += true_prob

    if i_print < 100:
        i_print += 1
        h_str = str(decode_sequence(history))
        print('{:50}   best = {} ({:.3f})   true = {} ({:.3f})'.format(h_str, id2word[best], pred_prob, id2word[label], true_prob))

print('\nAverage probability for next token = {:.3f}'.format(average_prob / len(X1)))

#
# TODO: check if average probability is in the same order of magnitude for negative examples
# or if we have a similar behavior.
#

['a', 'small', 'florida', 'beach', 'town']           best = , (0.290)   true = in (0.138)
['beach', 'town', 'in', 'the', 'dead']               best = , (0.194)   true = of (0.009)
['the', 'dead', 'of', 'winter', 'i']                 best = have (0.187)   true = 've (0.060)
['winter', 'i', "'ve", 'been', 'there']              best = . (0.162)   true = , (0.153)
['been', 'there', ',', 'and', 'this']                best = is (0.424)   true = is (0.424)
["'s", 'also', 'the', 'debut', 'feature']            best = of (0.466)   true = of (0.466)
['debut', 'feature', 'of', 'actress', 'ashley']      best = judd (0.160)   true = judd (0.160)
['actress', 'ashley', 'judd', ',', 'and']            best = a (0.115)   true = she (0.019)
[',', 'and', 'she', 'makes', 'a']                    best = lot (0.072)   true = big (0.009)
['makes', 'a', 'big', 'impression', 'here']          best = . (0.314)   true = . (0.314)
['it', "'s", 'hard', 'to', 'believe']                best = that (0.387)   true = this 

In [24]:
#
# Basic text generation routine starting from a prompt.
# 

MAX_SENTENCE_SIZE = 100

#
# TODO: This is now your turn to work and write a small loop that generates tokens
# following the prompt.  As we discarded all input training sequences with a period,
# you should stop generating text whenever we select the period token or when the 
# limit of MAX_SENTENCE_SIZE is reached.
# 

def generate(prompt, model, stop):
    text = prompt

    for i in range(MAX_SENTENCE_SIZE):
        print(text[-5:])
        output = predict(model, text[-5:])
        print(output)
        token = output[0]
        text.append(token)
        if token == stop:
            break
        
    return text

prompt = encode_sequence(['it', "'s", 'hard', 'to', 'believe'])
# text = encode_sequence(['that', 'the', 'film', 'is', 'not'])
# text = encode_sequence(['this', 'is', 'one', 'of', 'the'])
# text = encode_sequence(['take', 'the', 'time', 'to', 'watch'])

sample = generate(prompt, model, [word2id['.']])
print(' '.join(decode_sequence(sample)))


[10, 14, 283, 7, 303]
(12, 0.3870387, 0)
[14, 283, 7, 303, 12]
(1, 0.09925266, 0)
[283, 7, 303, 12, 1]
(20, 0.058296826, 0)
[7, 303, 12, 1, 20]
(8, 0.4920028, 0)
[303, 12, 1, 20, 8]
(5, 0.07880144, 0)
[12, 1, 20, 8, 5]


ValueError: Failed to find data adapter that can handle input: (<class 'list'> containing values of types {'(<class \'list\'> containing values of types {"<class \'numpy.int64\'>"})'}), <class 'NoneType'>

: 

In [17]:
[word2id['.']]

[3]