<a href="https://colab.research.google.com/github/Dreaming-world/learn_tensorflow_nlp/blob/master/load_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import

In [2]:
!pip install tf-nightly
import tensorflow as tf

import tensorflow_datasets as tfds
import os
import numpy as np




## 下载数据

In [3]:
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)
  
parent_dir = os.path.dirname(text_dir)

parent_dir

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/cowper.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/derby.txt
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/illiad/butler.txt


'/root/.keras/datasets'

In [4]:
# 查看每一个文件的具体内容
import os

for file_name in FILE_NAMES:
  file_path = os.path.join(parent_dir, file_name)
  show_num = 2
  with open(file_path, 'r') as f:
    print(file_path)
    while show_num > 0:
      show_num -= 1
      line = f.readline().strip()
      print(line)

/root/.keras/datasets/cowper.txt
﻿Achilles sing, O Goddess! Peleus' son;
His wrath pernicious, who ten thousand woes
/root/.keras/datasets/derby.txt
﻿Of Peleus' son, Achilles, sing, O Muse,
The vengeance, deep and deadly; whence to Greece
/root/.keras/datasets/butler.txt
﻿Sing, O goddess, the anger of Achilles son of Peleus, that brought
countless ills upon the Achaeans. Many a brave soul did it send


## 加载数据到tensorflow格式
返回数据的格式为：
  (sentence, file_id)
第一步：`tf.data.TextLineDataset` 读取文本数据。利用map的方式处理feature和label

In [5]:
def labeler(example, index):
  return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name)) # 读取数据
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))  # 处理数据的特征与标签，此时数据只有一行数据
  labeled_data_sets.append(labeled_dataset)

for ele in labeled_data_sets:
  for sample in ele.take(5):
    print(sample)


(<tf.Tensor: shape=(), dtype=string, numpy=b"\xef\xbb\xbfAchilles sing, O Goddess! Peleus' son;">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'His wrath pernicious, who ten thousand woes'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"Caused to Achaia's host, sent many a soul">, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Illustrious into Ades premature,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'And Heroes gave (so stood the will of Jove)'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b"\xef\xbb\xbfOf Peleus' son, Achilles, sing, O Muse,">, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'The vengeance, deep and deadly; whence to Greece'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=strin

## 拼接三个文本到一起


In [0]:
BUFFER_SIZE = 50000
BATCH_SIZE = 64
TAKE_SIZE = 5000

In [0]:
all_labeled_data = labeled_data_sets[0]
for labeled_dataset in labeled_data_sets[1:]:
  all_labeled_data = all_labeled_data.concatenate(labeled_dataset)

# shuffle的功能为打乱dataset中的元素，它有一个参数buffersize，表示打乱时使用的buffer的大小
all_labeled_data = all_labeled_data.shuffle(
    BUFFER_SIZE, reshuffle_each_iteration=False)

使用 `tf.data.Dataset.take` 和 `print` 查看 `(example, label)`

In [8]:
for ex in all_labeled_data.take(5):
  print(ex)

(<tf.Tensor: shape=(), dtype=string, numpy=b"And Priam's offspring, bring disastrous fate?">, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'On either side: but when the hour was come'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'Seeking the chamber of Laodice,'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'The weapons flew; on helm and bossy shield'>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(), dtype=string, numpy=b'at the top of his voice, now from the acropolis, and now speeding up'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)


## 向量化文本

机器学习模型处理的是数字，而不是单词，因此需要将字符串值转换为数字列表。为此，将每个惟一的单词映射到一个惟一的整数

### 建立词典

### 根据词典索引向量化文本

注意; 分词方式保持一致

首先，通过将文本标记为单独的单词集合来构建词汇表。在TensorFlow和Python中有几种方法可以做到这一点。本教程:

1. Iterate over each example's `numpy` value.
2. Use `tfds.features.text.Tokenizer` to split it into tokens.
3. Collect these tokens into a Python set, to remove duplicates.
4. Get the size of the vocabulary for later use.

In [9]:
# 根据需求指定分词方法
import jieba  # 中文版
tokenizer = tfds.features.text.Tokenizer()  # 英文
vocabulary_set = set()
for text_tensor, _ in all_labeled_data:  # sentence file_id
  # some_tokens = tokenizer.tokenize(text_tensor.numpy())
  some_tokens = jieba.lcut(text_tensor.numpy().strip())

  vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.871 seconds.
Prefix dict has been built successfully.


17186

In [12]:
print(list(vocabulary_set)[:5])

['leopardess', 'Wound', 'extraordinarily', 'murd', 'oxhide']


### 文本向量化

第一步：建立词-index 与 index-词的索引表
第二部：建立自己的文本向量化方式

注意：分词方式与文本向量化时保持一致

Create an encoder by passing the `vocabulary_set` to `tfds.features.text.TokenTextEncoder`. The encoder's `encode` method takes in a string of text and returns a list of integers.

In [0]:
char2idx = {}
char2idx['UNK'] = 0
for i,u in enumerate(vocabulary_set):
  char2idx[u] = i + 1
idx2char = {char2idx[key]:key for key in char2idx.keys()}

In [0]:
# 指定文本向量化的方式
class MyEncoder:
  def __init__(self, char2idx):
    self.char2idx = char2idx
  def encode(self, setence):
    word_list = jieba.lcut(setence.strip())
    result = []
    for ele in word_list:
      if ele in self.char2idx:
        result.append(self.char2idx[ele])
      else:
        result.append(self.char2idx["UNK"])
    return result

In [0]:
myencoder = MyEncoder(char2idx=char2idx)  # 对应的分词方式为结巴
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)  # 对应的分词方式为tensorflow的token

You can try this on a single line to see what the output looks like.

In [17]:
# 在单条数据上测试文本向量化
example_text = next(iter(all_labeled_data))[0].numpy()
print(example_text)
# encoded_example = encoder.encode(example_text)
encoded_example = myencoder.encode(example_text)
print(encoded_example)

b"And Priam's offspring, bring disastrous fate?"
[312, 8169, 16676, 16221, 11986, 8169, 15664, 1353, 8169, 9185, 8169, 153, 8169, 13031, 17052]


Now run the encoder on the dataset by wrapping it in `tf.py_function` and  passing that to the dataset's `map` method.

In [0]:
def encode(text_tensor, label):
  # encoded_text = encoder.encode(text_tensor.numpy())
  encoded_text = myencoder.encode(text_tensor.numpy())
  return encoded_text, label

You want to use `Dataset.map` to apply this function to each element of the dataset.  `Dataset.map` runs in graph mode.

* Graph tensors do not have a value. 
* In graph mode you can only use TensorFlow Ops and functions. 

So you can't `.map` this function directly: You need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

In [0]:
def encode_map_fn(text, label):
  # py_func doesn't set the shape of the returned tensors.
  encoded_text, label = tf.py_function(encode,
  inp=[text, label], Tout=(tf.int64, tf.int64))

  # `tf.data.Datasets` work best if all components have a shape set
  #  so set the shapes manually: 
  encoded_text.set_shape([None])
  label.set_shape([])

  return encoded_text, label

all_encoded_data = all_labeled_data.map(encode_map_fn)

## 分割数据集为测试集合训练集

Use `tf.data.Dataset.take` and `tf.data.Dataset.skip` to create a small test dataset and a larger training set.

Before being passed into the model, the datasets need to be batched. Typically, the examples inside of a batch need to be the same size and shape. But, the examples in these datasets are not all the same size — each line of text had a different number of words. So use `tf.data.Dataset.padded_batch` (instead of `batch`) to pad the examples to the same size.

In [0]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE, padded_shapes=([None],[]))

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE, padded_shapes=([None],[]))

Note: As of **TensorFlow 2.2** the `padded_shapes` argument is no longer required. The default behavior is to pad all axes to the longest in the batch.

In [0]:
train_data = all_encoded_data.skip(TAKE_SIZE).shuffle(BUFFER_SIZE)
train_data = train_data.padded_batch(BATCH_SIZE)

test_data = all_encoded_data.take(TAKE_SIZE)
test_data = test_data.padded_batch(BATCH_SIZE)

Now, `test_data` and `train_data` are not collections of (`example, label`) pairs, but collections of batches. Each batch is a pair of (*many examples*, *many labels*) represented as arrays.

To illustrate:

In [22]:
sample_text, sample_labels = next(iter(test_data))

sample_text[0], sample_labels[0]

(<tf.Tensor: shape=(30,), dtype=int64, numpy=
 array([  312,  8169, 16676, 16221, 11986,  8169, 15664,  1353,  8169,
         9185,  8169,   153,  8169, 13031, 17052,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0])>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)

Since we have introduced a new token encoding (the zero used for padding), the vocabulary size has increased by one.

## 建立模型


In [0]:
model = tf.keras.Sequential()

The first layer converts integer representations to dense vector embeddings. See the [word embeddings tutorial](../text/word_embeddings.ipynb) or more details. 

In [0]:
model.add(tf.keras.layers.Embedding(len(char2idx), 64, mask_zero=True))

The next layer is a [Long Short-Term Memory](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) layer, which lets the model understand words in their context with other words. A bidirectional wrapper on the LSTM helps it to learn about the datapoints in relationship to the datapoints that came before it and after it.

In [0]:
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))

Finally we'll have a series of one or more densely connected layers, with the last one being the output layer. The output layer produces a probability for all the labels. The one with the highest probability is the models prediction of an example's label.

In [0]:
# One or more dense layers.
# Edit the list in the `for` line to experiment with layer sizes.
for units in [64, 64]:
  model.add(tf.keras.layers.Dense(units, activation='relu'))

# Output layer. The first argument is the number of labels.
model.add(tf.keras.layers.Dense(3))

In [27]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          1099968   
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               66048     
_________________________________________________________________
dense (Dense)                (None, 64)                8256      
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 195       
Total params: 1,178,627
Trainable params: 1,178,627
Non-trainable params: 0
_________________________________________________________________


Finally, compile the model. For a softmax categorization model, use `sparse_categorical_crossentropy` as the loss function. You can try other optimizers, but `adam` is very common.

In [0]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

## 训练模型

This model running on this data produces decent results (about 83%).

In [29]:
model.fit(train_data, epochs=3, validation_data=test_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fea471d9160>

In [30]:
eval_loss, eval_acc = model.evaluate(test_data)

print('\nEval loss: {:.3f}, Eval accuracy: {:.3f}'.format(eval_loss, eval_acc))


Eval loss: 0.351, Eval accuracy: 0.854


# 主要总结如何读入文本数据