diff --git a/.travis.yml b/.travis.yml index f069c12016..a5559a7963 100644 --- a/.travis.yml +++ b/.travis.yml @@ -21,9 +21,8 @@ before_install: - docker pull paddlepaddle/paddle:latest script: - .travis/precommit.sh - - docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c - "cd /py_unittest && find . -name 'tests' -type d -print0 | xargs -0 -I{} -n1 bash -c 'cd {}; - python -m unittest discover -v'" + - docker run -i --rm -v "$PWD:/py_unittest" paddlepaddle/paddle:latest /bin/bash -c + 'cd /py_unittest; sh .travis/unittest.sh' notifications: email: diff --git a/.travis/unittest.sh b/.travis/unittest.sh new file mode 100755 index 0000000000..23f15094c8 --- /dev/null +++ b/.travis/unittest.sh @@ -0,0 +1,35 @@ +#!/bin/bash + +abort(){ + echo "Run unittest failed" 1>&2 + echo "Please check your code" 1>&2 + exit 1 +} + +unittest(){ + cd $1 > /dev/null + if [ -f "requirements.txt" ]; then + pip install -r requirements.txt + fi + if [ $? != 0 ]; then + exit 1 + fi + find . -name 'tests' -type d -print0 | \ + xargs -0 -I{} -n1 bash -c \ + 'python -m unittest discover -v -s {}' + cd - > /dev/null +} + +trap 'abort' 0 +set -e + +for proj in */ ; do + if [ -d $proj ]; then + unittest $proj + if [ $? != 0 ]; then + exit 1 + fi + fi +done + +trap : 0 diff --git a/README.md b/README.md index 8fd9edfec3..fb0e20bf42 100644 --- a/README.md +++ b/README.md @@ -14,46 +14,57 @@ PaddlePaddle提供了丰富的运算单元,帮助大家以模块化的方式 在词向量的例子中,我们向大家展示如何使用Hierarchical-Sigmoid 和噪声对比估计(Noise Contrastive Estimation,NCE)来加速词向量的学习。 - 1.1 [Hsigmoid加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/word_embedding) +- 1.2 [噪声对比估计加速词向量训练](https://github.com/PaddlePaddle/models/tree/develop/nce_cost) -## 2. 点击率预估 + +## 2. 语言模型 + +语言模型是自然语言处理领域里一个重要的基础模型,它是一个概率分布模型,利用它可以确定哪个词序列的可能性更大,或者给定若干个词,可以预测下一个最可能出现的词。语言模型被应用在很多领域,如:自动写作、QA、机器翻译、拼写检查、语音识别、词性标注等。 + +在语言模型的例子中,我们以文本生成为例,提供了RNN LM(包括LSTM、GRU)和N-Gram LM,供大家学习和使用。用户可以通过文档中的 “使用说明” 快速上手:适配训练语料,以训练 “自动写诗”、“自动写散文” 等有趣的模型。 + +- 2.1 [基于LSTM、GRU、N-Gram的文本生成模型](https://github.com/PaddlePaddle/models/tree/develop/language_model) + +## 3. 点击率预估 点击率预估模型预判用户对一条广告点击的概率,对每次广告的点击情况做出预测,是广告技术的核心算法之一。逻谛斯克回归对大规模稀疏特征有着很好的学习能力,在点击率预估任务发展的早期一统天下。近年来,DNN 模型由于其强大的学习能力逐渐接过点击率预估任务的大旗。 在点击率预估的例子中,我们给出谷歌提出的 Wide & Deep 模型。这一模型融合了适用于学习抽象特征的 DNN 和适用于大规模稀疏特征的逻谛斯克回归两者模型的优点,可以作为一种相对成熟的模型框架使用, 在工业界也有一定的应用。 -- 2.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr) +- 3.1 [Wide & deep 点击率预估模型](https://github.com/PaddlePaddle/models/tree/develop/ctr) -## 3. 文本分类 +## 4. 文本分类 文本分类是自然语言处理领域最基础的任务之一,深度学习方法能够免除复杂的特征工程,直接使用原始文本作为输入,数据驱动地最优化分类准确率。 在文本分类的例子中,我们以情感分类任务为例,提供了基于DNN的非序列文本分类模型,以及基于CNN的序列模型供大家学习和使用(基于LSTM的模型见PaddleBook中[情感分类](https://github.com/PaddlePaddle/book/blob/develop/06.understand_sentiment/README.cn.md)一课)。 -- 3.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification) +- 4.1 [基于 DNN / CNN 的情感分类](https://github.com/PaddlePaddle/models/tree/develop/text_classification) -## 4. 排序学习 +## 5. 排序学习 排序学习(Learning to Rank, LTR)是信息检索和搜索引擎研究的核心问题之一,通过机器学习方法学习一个分值函数对待排序的候选进行打分,再根据分值的高低确定序关系。深度神经网络可以用来建模分值函数,构成各类基于深度学习的LTR模型。 在排序学习的例子中,我们介绍基于 RankLoss 损失函数的 Pairwise 排序模型和基于LambdaRank损失函数的Listwise排序模型(Pointwise学习策略见PaddleBook中[推荐系统](https://github.com/PaddlePaddle/book/blob/develop/05.recommender_system/README.cn.md)一课)。 -- 4.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr) +- 5.1 [基于 Pairwise 和 Listwise 的排序学习](https://github.com/PaddlePaddle/models/tree/develop/ltr) -## 5. 序列标注 +## 6. 序列标注 给定输入序列,序列标注模型为序列中每一个元素贴上一个类别标签,是自然语言处理领域最基础的任务之一。随着深度学习的不断探索和发展,利用循环神经网络学习输入序列的特征表示,条件随机场(Conditional Random Field, CRF)在特征基础上完成序列标注任务,逐渐成为解决序列标注问题的标配解决方案。 在序列标注的例子中,我们以命名实体识别(Named Entity Recognition,NER)任务为例,介绍如何训练一个端到端的序列标注模型。 -- 5.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner) +- 6.1 [命名实体识别](https://github.com/PaddlePaddle/models/tree/develop/sequence_tagging_for_ner) -## 6. 序列到序列学习 +## 7. 序列到序列学习 序列到序列学习实现两个甚至是多个不定长模型之间的映射,有着广泛的应用,包括:机器翻译、智能对话与问答、广告创意语料生成、自动编码(如金融画像编码)、判断多个文本串之间的语义相关性等。 在序列到序列学习的例子中,我们以机器翻译任务为例,提供了多种改进模型,供大家学习和使用。包括:不带注意力机制的序列到序列映射模型,这一模型是所有序列到序列学习模型的基础;使用 scheduled sampling 改善 RNN 模型在生成任务中的错误累积问题;带外部记忆机制的神经机器翻译,通过增强神经网络的记忆能力,来完成复杂的序列到序列学习任务。 -- 6.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention) +- 7.1 [无注意力机制的编码器解码器模型](https://github.com/PaddlePaddle/models/tree/develop/nmt_without_attention) + ## Copyright and License PaddlePaddle is provided under the [Apache-2.0 license](LICENSE). diff --git a/deep_speech_2/compute_mean_std.py b/deep_speech_2/compute_mean_std.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/__init__.py b/deep_speech_2/data_utils/__init__.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/audio.py b/deep_speech_2/data_utils/audio.py old mode 100755 new mode 100644 index 916c8ac1ae..5d02feb60d --- a/deep_speech_2/data_utils/audio.py +++ b/deep_speech_2/data_utils/audio.py @@ -6,6 +6,10 @@ import numpy as np import io import soundfile +import scikits.samplerate +from scipy import signal +import random +import copy class AudioSegment(object): @@ -75,6 +79,32 @@ def from_bytes(cls, bytes): io.BytesIO(bytes), dtype='float32') return cls(samples, sample_rate) + @classmethod + def concatenate(cls, *segments): + """Concatenate an arbitrary number of audio segments together. + + :param *segments: Input audio segments to be concatenated. + :type *segments: tuple of AudioSegment + :return: Audio segment instance as concatenating results. + :rtype: AudioSegment + :raises ValueError: If the number of segments is zero, or if the + sample_rate of any segments does not match. + :raises TypeError: If any segment is not AudioSegment instance. + """ + # Perform basic sanity-checks. + if len(segments) == 0: + raise ValueError("No audio segments are given to concatenate.") + sample_rate = segments[0]._sample_rate + for seg in segments: + if sample_rate != seg._sample_rate: + raise ValueError("Can't concatenate segments with " + "different sample rates") + if type(seg) is not cls: + raise TypeError("Only audio segments of the same type " + "can be concatenated.") + samples = np.concatenate([seg.samples for seg in segments]) + return cls(samples, sample_rate) + def to_wav_file(self, filepath, dtype='float32'): """Save audio segment to disk as wav file. @@ -100,6 +130,89 @@ def to_wav_file(self, filepath, dtype='float32'): format='WAV', subtype=subtype_map[dtype]) + @classmethod + def slice_from_file(cls, file, start=None, end=None): + """Loads a small section of an audio without having to load + the entire file into the memory which can be incredibly wasteful. + + :param file: Input audio filepath or file object. + :type file: basestring|file + :param start: Start time in seconds. If start is negative, it wraps + around from the end. If not provided, this function + reads from the very beginning. + :type start: float + :param end: End time in seconds. If end is negative, it wraps around + from the end. If not provided, the default behvaior is + to read to the end of the file. + :type end: float + :return: AudioSegment instance of the specified slice of the input + audio file. + :rtype: AudioSegment + :raise ValueError: If start or end is incorrectly set, e.g. out of + bounds in time. + """ + sndfile = soundfile.SoundFile(file) + sample_rate = sndfile.samplerate + duration = float(len(sndfile)) / sample_rate + start = 0. if start is None else start + end = 0. if end is None else end + if start < 0.0: + start += duration + if end < 0.0: + end += duration + if start < 0.0: + raise ValueError("The slice start position (%f s) is out of " + "bounds." % start) + if end < 0.0: + raise ValueError("The slice end position (%f s) is out of bounds." % + end) + if start > end: + raise ValueError("The slice start position (%f s) is later than " + "the slice end position (%f s)." % (start, end)) + if end > duration: + raise ValueError("The slice end position (%f s) is out of bounds " + "(> %f s)" % (end, duration)) + start_frame = int(start * sample_rate) + end_frame = int(end * sample_rate) + sndfile.seek(start_frame) + data = sndfile.read(frames=end_frame - start_frame, dtype='float32') + return cls(data, sample_rate) + + @classmethod + def make_silence(cls, duration, sample_rate): + """Creates a silent audio segment of the given duration and sample rate. + + :param duration: Length of silence in seconds. + :type duration: float + :param sample_rate: Sample rate. + :type sample_rate: float + :return: Silent AudioSegment instance of the given duration. + :rtype: AudioSegment + """ + samples = np.zeros(int(duration * sample_rate)) + return cls(samples, sample_rate) + + def superimpose(self, other): + """Add samples from another segment to those of this segment + (sample-wise addition, not segment concatenation). + + Note that this is an in-place transformation. + + :param other: Segment containing samples to be added in. + :type other: AudioSegments + :raise TypeError: If type of two segments don't match. + :raise ValueError: If the sample rates of the two segments are not + equal, or if the lengths of segments don't match. + """ + if type(self) != type(other): + raise TypeError("Cannot add segments of different types: %s " + "and %s." % (type(self), type(other))) + if self._sample_rate != other._sample_rate: + raise ValueError("Sample rates must match to add segments.") + if len(self._samples) != len(other._samples): + raise ValueError("Segment lengths must match to add segments.") + self._samples += other._samples + def to_bytes(self, dtype='float32'): """Create a byte string containing the audio content. @@ -143,23 +256,257 @@ def change_speed(self, speed_rate): new_indices = np.linspace(start=0, stop=old_length, num=new_length) self._samples = np.interp(new_indices, old_indices, self._samples) - def normalize(self, target_sample_rate): - raise NotImplementedError() + def normalize(self, target_db=-20, max_gain_db=300.0): + """Normalize audio to be of the desired RMS value in decibels. + + Note that this is an in-place transformation. + + :param target_db: Target RMS value in decibels. This value should be + less than 0.0 as 0.0 is full-scale audio. + :type target_db: float + :param max_gain_db: Max amount of gain in dB that can be applied for + normalization. This is to prevent nans when + attempting to normalize a signal consisting of + all zeros. + :type max_gain_db: float + :raises ValueError: If the required gain to normalize the segment to + the target_db value exceeds max_gain_db. + """ + gain = target_db - self.rms_db + if gain > max_gain_db: + raise ValueError( + "Unable to normalize segment to %f dB because the " + "the probable gain have exceeds max_gain_db (%f dB)" % + (target_db, max_gain_db)) + self.apply_gain(min(max_gain_db, target_db - self.rms_db)) + + def normalize_online_bayesian(self, + target_db, + prior_db, + prior_samples, + startup_delay=0.0): + """Normalize audio using a production-compatible online/causal + algorithm. This uses an exponential likelihood and gamma prior to + make online estimates of the RMS even when there are very few samples. + + Note that this is an in-place transformation. + + :param target_db: Target RMS value in decibels. + :type target_bd: float + :param prior_db: Prior RMS estimate in decibels. + :type prior_db: float + :param prior_samples: Prior strength in number of samples. + :type prior_samples: float + :param startup_delay: Default 0.0s. If provided, this function will + accrue statistics for the first startup_delay + seconds before applying online normalization. + :type startup_delay: float + """ + # Estimate total RMS online. + startup_sample_idx = min(self.num_samples - 1, + int(self.sample_rate * startup_delay)) + prior_mean_squared = 10.**(prior_db / 10.) + prior_sum_of_squares = prior_mean_squared * prior_samples + cumsum_of_squares = np.cumsum(self.samples**2) + sample_count = np.arange(len(self.num_samples)) + 1 + if startup_sample_idx > 0: + cumsum_of_squares[:startup_sample_idx] = \ + cumsum_of_squares[startup_sample_idx] + sample_count[:startup_sample_idx] = \ + sample_count[startup_sample_idx] + mean_squared_estimate = ((cumsum_of_squares + prior_sum_of_squares) / + (sample_count + prior_samples)) + rms_estimate_db = 10 * np.log10(mean_squared_estimate) + # Compute required time-varying gain. + gain_db = target_db - rms_estimate_db + self.apply_gain(gain_db) + + def resample(self, target_sample_rate, quality='sinc_medium'): + """Resample the audio to a target sample rate. + + Note that this is an in-place transformation. - def resample(self, target_sample_rate): - raise NotImplementedError() + :param target_sample_rate: Target sample rate. + :type target_sample_rate: int + :param quality: One of {'sinc_fastest', 'sinc_medium', 'sinc_best'}. + Sets resampling speed/quality tradeoff. + See http://www.mega-nerd.com/SRC/api_misc.html#Converters + :type quality: str + """ + resample_ratio = target_sample_rate / self._sample_rate + self._samples = scikits.samplerate.resample( + self._samples, r=resample_ratio, type=quality) + self._sample_rate = target_sample_rate def pad_silence(self, duration, sides='both'): - raise NotImplementedError() + """Pad this audio sample with a period of silence. + + Note that this is an in-place transformation. + + :param duration: Length of silence in seconds to pad. + :type duration: float + :param sides: Position for padding: + 'beginning' - adds silence in the beginning; + 'end' - adds silence in the end; + 'both' - adds silence in both the beginning and the end. + :type sides: str + :raises ValueError: If sides is not supported. + """ + if duration == 0.0: + return self + cls = type(self) + silence = self.make_silence(duration, self._sample_rate) + if sides == "beginning": + padded = cls.concatenate(silence, self) + elif sides == "end": + padded = cls.concatenate(self, silence) + elif sides == "both": + padded = cls.concatenate(silence, self, silence) + else: + raise ValueError("Unknown value for the sides %s" % sides) + self._samples = padded._samples def subsegment(self, start_sec=None, end_sec=None): - raise NotImplementedError() + """Cut the AudioSegment between given boundaries. - def convolve(self, filter, allow_resample=False): - raise NotImplementedError() + Note that this is an in-place transformation. + + :param start_sec: Beginning of subsegment in seconds. + :type start_sec: float + :param end_sec: End of subsegment in seconds. + :type end_sec: float + :raise ValueError: If start_sec or end_sec is incorrectly set, e.g. out + of bounds in time. + """ + start_sec = 0.0 if start_sec is None else start_sec + end_sec = self.duration if end_sec is None else end_sec + if start_sec < 0.0: + start_sec = self.duration + start_sec + if end_sec < 0.0: + end_sec = self.duration + end_sec + if start_sec < 0.0: + raise ValueError("The slice start position (%f s) is out of " + "bounds." % start_sec) + if end_sec < 0.0: + raise ValueError("The slice end position (%f s) is out of bounds." % + end_sec) + if start_sec > end_sec: + raise ValueError("The slice start position (%f s) is later than " + "the end position (%f s)." % (start_sec, end_sec)) + if end_sec > self.duration: + raise ValueError("The slice end position (%f s) is out of bounds " + "(> %f s)" % (end_sec, self.duration)) + start_sample = int(round(start_sec * self._sample_rate)) + end_sample = int(round(end_sec * self._sample_rate)) + self._samples = self._samples[start_sample:end_sample] + + def random_subsegment(self, subsegment_length, rng=None): + """Cut the specified length of the audiosegment randomly. + + Note that this is an in-place transformation. + + :param subsegment_length: Subsegment length in seconds. + :type subsegment_length: float + :param rng: Random number generator state. + :type rng: random.Random + :raises ValueError: If the length of subsegment is greater than + the origineal segemnt. + """ + rng = random.Random() if rng is None else rng + if subsegment_length > self.duration: + raise ValueError("Length of subsegment must not be greater " + "than original segment.") + start_time = rng.uniform(0.0, self.duration - subsegment_length) + self.subsegment(start_time, start_time + subsegment_length) - def convolve_and_normalize(self, filter, allow_resample=False): - raise NotImplementedError() + def convolve(self, impulse_segment, allow_resample=False): + """Convolve this audio segment with the given impulse segment. + + Note that this is an in-place transformation. + + :param impulse_segment: Impulse response segments. + :type impulse_segment: AudioSegment + :param allow_resample: Indicates whether resampling is allowed when + the impulse_segment has a different sample + rate from this signal. + :type allow_resample: bool + :raises ValueError: If the sample rate is not match between two + audio segments when resample is not allowed. + """ + if allow_resample and self.sample_rate != impulse_segment.sample_rate: + impulse_segment = impulse_segment.resample(self.sample_rate) + if self.sample_rate != impulse_segment.sample_rate: + raise ValueError("Impulse segment's sample rate (%d Hz) is not" + "equal to base signal sample rate (%d Hz)." % + (impulse_segment.sample_rate, self.sample_rate)) + samples = signal.fftconvolve(self.samples, impulse_segment.samples, + "full") + self._samples = samples + + def convolve_and_normalize(self, impulse_segment, allow_resample=False): + """Convolve and normalize the resulting audio segment so that it + has the same average power as the input signal. + + Note that this is an in-place transformation. + + :param impulse_segment: Impulse response segments. + :type impulse_segment: AudioSegment + :param allow_resample: Indicates whether resampling is allowed when + the impulse_segment has a different sample + rate from this signal. + :type allow_resample: bool + """ + target_db = self.rms_db + self.convolve(impulse_segment, allow_resample=allow_resample) + self.normalize(target_db) + + def add_noise(self, + noise, + snr_dB, + allow_downsampling=False, + max_gain_db=300.0, + rng=None): + """Add the given noise segment at a specific signal-to-noise ratio. + If the noise segment is longer than this segment, a random subsegment + of matching length is sampled from it and used instead. + + Note that this is an in-place transformation. + + :param noise: Noise signal to add. + :type noise: AudioSegment + :param snr_dB: Signal-to-Noise Ratio, in decibels. + :type snr_dB: float + :param allow_downsampling: Whether to allow the noise signal to be + downsampled to match the base signal sample + rate. + :type allow_downsampling: bool + :param max_gain_db: Maximum amount of gain to apply to noise signal + before adding it in. This is to prevent attempting + to apply infinite gain to a zero signal. + :type max_gain_db: float + :param rng: Random number generator state. + :type rng: None|random.Random + :raises ValueError: If the sample rate does not match between the two + audio segments when downsampling is not allowed, or + if the duration of noise segments is shorter than + original audio segments. + """ + rng = random.Random() if rng is None else rng + if allow_downsampling and noise.sample_rate > self.sample_rate: + noise = noise.resample(self.sample_rate) + if noise.sample_rate != self.sample_rate: + raise ValueError("Noise sample rate (%d Hz) is not equal to base " + "signal sample rate (%d Hz)." % (noise.sample_rate, + self.sample_rate)) + if noise.duration < self.duration: + raise ValueError("Noise signal (%f sec) must be at least as long as" + " base signal (%f sec)." % + (noise.duration, self.duration)) + noise_gain_db = min(self.rms_db - noise.rms_db - snr_dB, max_gain_db) + noise_new = copy.deepcopy(noise) + noise_new.random_subsegment(self.duration, rng=rng) + noise_new.apply_gain(noise_gain_db) + self.superimpose(noise_new) @property def samples(self): @@ -186,7 +533,7 @@ def num_samples(self): :return: Number of samples. :rtype: int """ - return self._samples.shape(0) + return self._samples.shape[0] @property def duration(self): @@ -230,7 +577,7 @@ def _convert_samples_from_float32(self, samples, dtype): Audio sample type is usually integer or float-point. For integer type, float32 will be rescaled from [-1, 1] to the maximum range supported by the integer type. - + This is for writing a audio file. """ dtype = np.dtype(dtype) diff --git a/deep_speech_2/data_utils/augmentor/__init__.py b/deep_speech_2/data_utils/augmentor/__init__.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/augmentor/augmentation.py b/deep_speech_2/data_utils/augmentor/augmentation.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/augmentor/base.py b/deep_speech_2/data_utils/augmentor/base.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/augmentor/volume_perturb.py b/deep_speech_2/data_utils/augmentor/volume_perturb.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/data.py b/deep_speech_2/data_utils/data.py index 48e03fe85d..424343a48f 100644 --- a/deep_speech_2/data_utils/data.py +++ b/deep_speech_2/data_utils/data.py @@ -80,7 +80,7 @@ def batch_reader_creator(self, padding_to=-1, flatten=False, sortagrad=False, - batch_shuffle=False): + shuffle_method="batch_shuffle"): """ Batch data reader creator for audio data. Return a callable generator function to produce batches of data. @@ -104,12 +104,22 @@ def batch_reader_creator(self, :param sortagrad: If set True, sort the instances by audio duration in the first epoch for speed up training. :type sortagrad: bool - :param batch_shuffle: If set True, instances are batch-wise shuffled. - For more details, please see - ``_batch_shuffle.__doc__``. - If sortagrad is True, batch_shuffle is disabled + :param shuffle_method: Shuffle method. Options: + '' or None: no shuffle. + 'instance_shuffle': instance-wise shuffle. + 'batch_shuffle': similarly-sized instances are + put into batches, and then + batch-wise shuffle the batches. + For more details, please see + ``_batch_shuffle.__doc__``. + 'batch_shuffle_clipped': 'batch_shuffle' with + head shift and tail + clipping. For more + details, please see + ``_batch_shuffle``. + If sortagrad is True, shuffle is disabled for the first epoch. - :type batch_shuffle: bool + :type shuffle_method: None|str :return: Batch reader function, producing batches of data when called. :rtype: callable """ @@ -123,8 +133,20 @@ def batch_reader(): # sort (by duration) or batch-wise shuffle the manifest if self._epoch == 0 and sortagrad: manifest.sort(key=lambda x: x["duration"]) - elif batch_shuffle: - manifest = self._batch_shuffle(manifest, batch_size) + else: + if shuffle_method == "batch_shuffle": + manifest = self._batch_shuffle( + manifest, batch_size, clipped=False) + elif shuffle_method == "batch_shuffle_clipped": + manifest = self._batch_shuffle( + manifest, batch_size, clipped=True) + elif shuffle_method == "instance_shuffle": + self._rng.shuffle(manifest) + elif not shuffle_method: + pass + else: + raise ValueError("Unknown shuffle method %s." % + shuffle_method) # prepare batches instance_reader = self._instance_reader_creator(manifest) batch = [] @@ -218,7 +240,7 @@ def _padding_batch(self, batch, padding_to=-1, flatten=False): new_batch.append((padded_audio, text)) return new_batch - def _batch_shuffle(self, manifest, batch_size): + def _batch_shuffle(self, manifest, batch_size, clipped=False): """Put similarly-sized instances into minibatches for better efficiency and make a batch-wise shuffle. @@ -233,6 +255,9 @@ def _batch_shuffle(self, manifest, batch_size): :param batch_size: Batch size. This size is also used for generate a random number for batch shuffle. :type batch_size: int + :param clipped: Whether to clip the heading (small shift) and trailing + (incomplete batch) instances. + :type clipped: bool :return: Batch shuffled mainifest. :rtype: list """ @@ -241,7 +266,8 @@ def _batch_shuffle(self, manifest, batch_size): batch_manifest = zip(*[iter(manifest[shift_len:])] * batch_size) self._rng.shuffle(batch_manifest) batch_manifest = list(sum(batch_manifest, ())) - res_len = len(manifest) - shift_len - len(batch_manifest) - batch_manifest.extend(manifest[-res_len:]) - batch_manifest.extend(manifest[0:shift_len]) + if not clipped: + res_len = len(manifest) - shift_len - len(batch_manifest) + batch_manifest.extend(manifest[-res_len:]) + batch_manifest.extend(manifest[0:shift_len]) return batch_manifest diff --git a/deep_speech_2/data_utils/featurizer/__init__.py b/deep_speech_2/data_utils/featurizer/__init__.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/featurizer/audio_featurizer.py b/deep_speech_2/data_utils/featurizer/audio_featurizer.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/featurizer/speech_featurizer.py b/deep_speech_2/data_utils/featurizer/speech_featurizer.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/featurizer/text_featurizer.py b/deep_speech_2/data_utils/featurizer/text_featurizer.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/normalizer.py b/deep_speech_2/data_utils/normalizer.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/data_utils/speech.py b/deep_speech_2/data_utils/speech.py old mode 100755 new mode 100644 index 48db595b41..fc031ff46f --- a/deep_speech_2/data_utils/speech.py +++ b/deep_speech_2/data_utils/speech.py @@ -65,6 +65,74 @@ def from_bytes(cls, bytes, transcript): audio = AudioSegment.from_bytes(bytes) return cls(audio.samples, audio.sample_rate, transcript) + @classmethod + def concatenate(cls, *segments): + """Concatenate an arbitrary number of speech segments together, both + audio and transcript will be concatenated. + + :param *segments: Input speech segments to be concatenated. + :type *segments: tuple of SpeechSegment + :return: Speech segment instance. + :rtype: SpeechSegment + :raises ValueError: If the number of segments is zero, or if the + sample_rate of any two segments does not match. + :raises TypeError: If any segment is not SpeechSegment instance. + """ + if len(segments) == 0: + raise ValueError("No speech segments are given to concatenate.") + sample_rate = segments[0]._sample_rate + transcripts = "" + for seg in segments: + if sample_rate != seg._sample_rate: + raise ValueError("Can't concatenate segments with " + "different sample rates") + if type(seg) is not cls: + raise TypeError("Only speech segments of the same type " + "instance can be concatenated.") + transcripts += seg._transcript + samples = np.concatenate([seg.samples for seg in segments]) + return cls(samples, sample_rate, transcripts) + + @classmethod + def slice_from_file(cls, filepath, start=None, end=None, transcript): + """Loads a small section of an speech without having to load + the entire file into the memory which can be incredibly wasteful. + + :param filepath: Filepath or file object to audio file. + :type filepath: basestring|file + :param start: Start time in seconds. If start is negative, it wraps + around from the end. If not provided, this function + reads from the very beginning. + :type start: float + :param end: End time in seconds. If end is negative, it wraps around + from the end. If not provided, the default behvaior is + to read to the end of the file. + :type end: float + :param transcript: Transcript text for the speech. if not provided, + the defaults is an empty string. + :type transript: basestring + :return: SpeechSegment instance of the specified slice of the input + speech file. + :rtype: SpeechSegment + """ + audio = Audiosegment.slice_from_file(filepath, start, end) + return cls(audio.samples, audio.sample_rate, transcript) + + @classmethod + def make_silence(cls, duration, sample_rate): + """Creates a silent speech segment of the given duration and + sample rate, transcript will be an empty string. + + :param duration: Length of silence in seconds. + :type duration: float + :param sample_rate: Sample rate. + :type sample_rate: float + :return: Silence of the given duration. + :rtype: SpeechSegment + """ + audio = AudioSegment.make_silence(duration, sample_rate) + return cls(audio.samples, audio.sample_rate, "") + @property def transcript(self): """Return the transcript text. diff --git a/deep_speech_2/data_utils/utils.py b/deep_speech_2/data_utils/utils.py old mode 100755 new mode 100644 diff --git a/deep_speech_2/datasets/librispeech/librispeech.py b/deep_speech_2/datasets/librispeech/librispeech.py index faf038cc19..87e52ae4aa 100644 --- a/deep_speech_2/datasets/librispeech/librispeech.py +++ b/deep_speech_2/datasets/librispeech/librispeech.py @@ -37,8 +37,7 @@ MD5_TRAIN_CLEAN_360 = "c0e676e450a7ff2f54aeade5171606fa" MD5_TRAIN_OTHER_500 = "d1a0fd59409feb2c614ce4d30c387708" -parser = argparse.ArgumentParser( - description='Downloads and prepare LibriSpeech dataset.') +parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--target_dir", default=DATA_HOME + "/Libri", diff --git a/deep_speech_2/datasets/run_all.sh b/deep_speech_2/datasets/run_all.sh old mode 100755 new mode 100644 diff --git a/deep_speech_2/decoder.py b/deep_speech_2/decoder.py old mode 100755 new mode 100644 index 8314885ce6..77d950b8db --- a/deep_speech_2/decoder.py +++ b/deep_speech_2/decoder.py @@ -8,8 +8,7 @@ def ctc_best_path_decode(probs_seq, vocabulary): - """ - Best path decoding, also called argmax decoding or greedy decoding. + """Best path decoding, also called argmax decoding or greedy decoding. Path consisting of the most probable tokens are further post-processed to remove consecutive repetitions and all blanks. @@ -38,8 +37,7 @@ def ctc_best_path_decode(probs_seq, vocabulary): def ctc_decode(probs_seq, vocabulary, method): - """ - CTC-like sequence decoding from a sequence of likelihood probablilites. + """CTC-like sequence decoding from a sequence of likelihood probablilites. :param probs_seq: 2-D list of probabilities over the vocabulary for each character. Each element is a list of float probabilities diff --git a/deep_speech_2/error_rate.py b/deep_speech_2/error_rate.py new file mode 100644 index 0000000000..0cf17921c0 --- /dev/null +++ b/deep_speech_2/error_rate.py @@ -0,0 +1,141 @@ +# -*- coding: utf-8 -*- +"""This module provides functions to calculate error rate in different level. +e.g. wer for word-level, cer for char-level. +""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np + + +def _levenshtein_distance(ref, hyp): + """Levenshtein distance is a string metric for measuring the difference between + two sequences. Informally, the levenshtein disctance is defined as the minimum + number of single-character edits (substitutions, insertions or deletions) + required to change one word into the other. We can naturally extend the edits to + word level when calculate levenshtein disctance for two sentences. + """ + ref_len = len(ref) + hyp_len = len(hyp) + + # special case + if ref == hyp: + return 0 + if ref_len == 0: + return hyp_len + if hyp_len == 0: + return ref_len + + distance = np.zeros((ref_len + 1, hyp_len + 1), dtype=np.int32) + + # initialize distance matrix + for j in xrange(hyp_len + 1): + distance[0][j] = j + for i in xrange(ref_len + 1): + distance[i][0] = i + + # calculate levenshtein distance + for i in xrange(1, ref_len + 1): + for j in xrange(1, hyp_len + 1): + if ref[i - 1] == hyp[j - 1]: + distance[i][j] = distance[i - 1][j - 1] + else: + s_num = distance[i - 1][j - 1] + 1 + i_num = distance[i][j - 1] + 1 + d_num = distance[i - 1][j] + 1 + distance[i][j] = min(s_num, i_num, d_num) + + return distance[ref_len][hyp_len] + + +def wer(reference, hypothesis, ignore_case=False, delimiter=' '): + """Calculate word error rate (WER). WER compares reference text and + hypothesis text in word-level. WER is defined as: + + .. math:: + WER = (Sw + Dw + Iw) / Nw + + where + + .. code-block:: text + + Sw is the number of words subsituted, + Dw is the number of words deleted, + Iw is the number of words inserted, + Nw is the number of words in the reference + + We can use levenshtein distance to calculate WER. Please draw an attention that + empty items will be removed when splitting sentences by delimiter. + + :param reference: The reference sentence. + :type reference: basestring + :param hypothesis: The hypothesis sentence. + :type hypothesis: basestring + :param ignore_case: Whether case-sensitive or not. + :type ignore_case: bool + :param delimiter: Delimiter of input sentences. + :type delimiter: char + :return: Word error rate. + :rtype: float + :raises ValueError: If the reference length is zero. + """ + if ignore_case == True: + reference = reference.lower() + hypothesis = hypothesis.lower() + + ref_words = filter(None, reference.split(delimiter)) + hyp_words = filter(None, hypothesis.split(delimiter)) + + if len(ref_words) == 0: + raise ValueError("Reference's word number should be greater than 0.") + + edit_distance = _levenshtein_distance(ref_words, hyp_words) + wer = float(edit_distance) / len(ref_words) + return wer + + +def cer(reference, hypothesis, ignore_case=False): + """Calculate charactor error rate (CER). CER compares reference text and + hypothesis text in char-level. CER is defined as: + + .. math:: + CER = (Sc + Dc + Ic) / Nc + + where + + .. code-block:: text + + Sc is the number of characters substituted, + Dc is the number of characters deleted, + Ic is the number of characters inserted + Nc is the number of characters in the reference + + We can use levenshtein distance to calculate CER. Chinese input should be + encoded to unicode. Please draw an attention that the leading and tailing + white space characters will be truncated and multiple consecutive white + space characters in a sentence will be replaced by one white space character. + + :param reference: The reference sentence. + :type reference: basestring + :param hypothesis: The hypothesis sentence. + :type hypothesis: basestring + :param ignore_case: Whether case-sensitive or not. + :type ignore_case: bool + :return: Character error rate. + :rtype: float + :raises ValueError: If the reference length is zero. + """ + if ignore_case == True: + reference = reference.lower() + hypothesis = hypothesis.lower() + + reference = ' '.join(filter(None, reference.split(' '))) + hypothesis = ' '.join(filter(None, hypothesis.split(' '))) + + if len(reference) == 0: + raise ValueError("Length of reference should be greater than 0.") + + edit_distance = _levenshtein_distance(reference, hypothesis) + cer = float(edit_distance) / len(reference) + return cer diff --git a/deep_speech_2/infer.py b/deep_speech_2/infer.py index f7c99df117..06449ab05c 100644 --- a/deep_speech_2/infer.py +++ b/deep_speech_2/infer.py @@ -10,9 +10,9 @@ from data_utils.data import DataGenerator from model import deep_speech2 from decoder import ctc_decode +import utils -parser = argparse.ArgumentParser( - description='Simplified version of DeepSpeech2 inference.') +parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( "--num_samples", default=10, @@ -62,9 +62,7 @@ def infer(): - """ - Max-ctc-decoding for DeepSpeech2. - """ + """Max-ctc-decoding for DeepSpeech2.""" # initialize data generator data_generator = DataGenerator( vocab_filepath=args.vocab_filepath, @@ -98,7 +96,7 @@ def infer(): manifest_path=args.decode_manifest_path, batch_size=args.num_samples, sortagrad=False, - batch_shuffle=False) + shuffle_method=None) infer_data = batch_reader().next() # run inference @@ -123,6 +121,7 @@ def infer(): def main(): + utils.print_arguments(args) paddle.init(use_gpu=args.use_gpu, trainer_count=1) infer() diff --git a/deep_speech_2/requirements.txt b/deep_speech_2/requirements.txt index 58a93debe4..c37e88ffe7 100644 --- a/deep_speech_2/requirements.txt +++ b/deep_speech_2/requirements.txt @@ -1,2 +1,4 @@ SoundFile==0.9.0.post1 wget==3.2 +scikits.samplerate==0.3.3 +scipy==0.13.0b1 diff --git a/deep_speech_2/tests/test_error_rate.py b/deep_speech_2/tests/test_error_rate.py new file mode 100644 index 0000000000..be7313f357 --- /dev/null +++ b/deep_speech_2/tests/test_error_rate.py @@ -0,0 +1,59 @@ +# -*- coding: utf-8 -*- +"""Test error rate.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import unittest +import error_rate + + +class TestParse(unittest.TestCase): + def test_wer_1(self): + ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' + hyp = 'i GOT IT TO the FULLEST i LOVE TO portable FROM OF STORES last night' + word_error_rate = error_rate.wer(ref, hyp) + self.assertTrue(abs(word_error_rate - 0.769230769231) < 1e-6) + + def test_wer_2(self): + ref = 'i UM the PHONE IS i LEFT THE portable PHONE UPSTAIRS last night' + word_error_rate = error_rate.wer(ref, ref) + self.assertEqual(word_error_rate, 0.0) + + def test_wer_3(self): + ref = ' ' + hyp = 'Hypothesis sentence' + with self.assertRaises(ValueError): + word_error_rate = error_rate.wer(ref, hyp) + + def test_cer_1(self): + ref = 'werewolf' + hyp = 'weae wolf' + char_error_rate = error_rate.cer(ref, hyp) + self.assertTrue(abs(char_error_rate - 0.25) < 1e-6) + + def test_cer_2(self): + ref = 'werewolf' + char_error_rate = error_rate.cer(ref, ref) + self.assertEqual(char_error_rate, 0.0) + + def test_cer_3(self): + ref = u'我是中国人' + hyp = u'我是 美洲人' + char_error_rate = error_rate.cer(ref, hyp) + self.assertTrue(abs(char_error_rate - 0.6) < 1e-6) + + def test_cer_4(self): + ref = u'我是中国人' + char_error_rate = error_rate.cer(ref, ref) + self.assertFalse(char_error_rate, 0.0) + + def test_cer_5(self): + ref = '' + hyp = 'Hypothesis' + with self.assertRaises(ValueError): + char_error_rate = error_rate.cer(ref, hyp) + + +if __name__ == '__main__': + unittest.main() diff --git a/deep_speech_2/train.py b/deep_speech_2/train.py index 7ac4626f4c..c60a039b69 100644 --- a/deep_speech_2/train.py +++ b/deep_speech_2/train.py @@ -12,6 +12,7 @@ import paddle.v2 as paddle from model import deep_speech2 from data_utils.data import DataGenerator +import utils parser = argparse.ArgumentParser(description=__doc__) parser.add_argument( @@ -51,6 +52,12 @@ default=True, type=distutils.util.strtobool, help="Use sortagrad or not. (default: %(default)s)") +parser.add_argument( + "--shuffle_method", + default='instance_shuffle', + type=str, + help="Shuffle method: 'instance_shuffle', 'batch_shuffle', " + "'batch_shuffle_batch'. (default: %(default)s)") parser.add_argument( "--trainer_count", default=4, @@ -93,9 +100,7 @@ def train(): - """ - DeepSpeech2 training. - """ + """DeepSpeech2 training.""" # initialize data generator def data_generator(): @@ -143,13 +148,15 @@ def data_generator(): train_batch_reader = train_generator.batch_reader_creator( manifest_path=args.train_manifest_path, batch_size=args.batch_size, + min_batch_size=args.trainer_count, sortagrad=args.use_sortagrad if args.init_model_path is None else False, - batch_shuffle=True) + shuffle_method=args.shuffle_method) test_batch_reader = test_generator.batch_reader_creator( manifest_path=args.dev_manifest_path, batch_size=args.batch_size, + min_batch_size=1, # must be 1, but will have errors. sortagrad=False, - batch_shuffle=False) + shuffle_method=None) # create event handler def event_handler(event): @@ -157,11 +164,11 @@ def event_handler(event): if isinstance(event, paddle.event.EndIteration): cost_sum += event.cost cost_counter += 1 - if event.batch_id % 50 == 0: - print("\nPass: %d, Batch: %d, TrainCost: %f" % - (event.pass_id, event.batch_id, cost_sum / cost_counter)) + if (event.batch_id + 1) % 100 == 0: + print("\nPass: %d, Batch: %d, TrainCost: %f" % ( + event.pass_id, event.batch_id + 1, cost_sum / cost_counter)) cost_sum, cost_counter = 0.0, 0 - with gzip.open("params_tmp.tar.gz", 'w') as f: + with gzip.open("params.tar.gz", 'w') as f: parameters.to_tar(f) else: sys.stdout.write('.') @@ -184,6 +191,7 @@ def event_handler(event): def main(): + utils.print_arguments(args) paddle.init(use_gpu=args.use_gpu, trainer_count=args.trainer_count) train() diff --git a/deep_speech_2/utils.py b/deep_speech_2/utils.py new file mode 100644 index 0000000000..9ca363c8f5 --- /dev/null +++ b/deep_speech_2/utils.py @@ -0,0 +1,25 @@ +"""Contains common utility functions.""" +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + + +def print_arguments(args): + """Print argparse's arguments. + + Usage: + + .. code-block:: python + + parser = argparse.ArgumentParser() + parser.add_argument("name", default="Jonh", type=str, help="User name.") + args = parser.parse_args() + print_arguments(args) + + :param args: Input argparse.Namespace for printing. + :type args: argparse.Namespace + """ + print("----- Configuration Arguments -----") + for arg, value in vars(args).iteritems(): + print("%s: %s" % (arg, value)) + print("------------------------------------") diff --git a/image_classification/README.md b/image_classification/README.md index a0990367ef..94a0a1b70e 100644 --- a/image_classification/README.md +++ b/image_classification/README.md @@ -1 +1,223 @@ -TBD +图像分类 +======================= + +这里将介绍如何在PaddlePaddle下使用AlexNet、VGG、GoogLeNet和ResNet模型进行图像分类。图像分类问题的描述和这四种模型的介绍可以参考[PaddlePaddle book](https://github.com/PaddlePaddle/book/tree/develop/03.image_classification)。 + +## 训练模型 + +### 初始化 + +在初始化阶段需要导入所用的包,并对PaddlePaddle进行初始化。 + +```python +import gzip +import paddle.v2.dataset.flowers as flowers +import paddle.v2 as paddle +import reader +import vgg +import resnet +import alexnet +import googlenet + + +# PaddlePaddle init +paddle.init(use_gpu=False, trainer_count=1) +``` + +### 定义参数和输入 + +设置算法参数(如数据维度、类别数目和batch size等参数),定义数据输入层`image`和类别标签`lbl`。 + +```python +DATA_DIM = 3 * 224 * 224 +CLASS_DIM = 102 +BATCH_SIZE = 128 + +image = paddle.layer.data( + name="image", type=paddle.data_type.dense_vector(DATA_DIM)) +lbl = paddle.layer.data( + name="label", type=paddle.data_type.integer_value(CLASS_DIM)) +``` + +### 获得所用模型 + +这里可以选择使用AlexNet、VGG、GoogLeNet和ResNet模型中的一个模型进行图像分类。通过调用相应的方法可以获得网络最后的Softmax层。 + +1. 使用AlexNet模型 + +指定输入层`image`和类别数目`CLASS_DIM`后,可以通过下面的代码得到AlexNet的Softmax层。 + +```python +out = alexnet.alexnet(image, class_dim=CLASS_DIM) +``` + +2. 使用VGG模型 + +根据层数的不同,VGG分为VGG13、VGG16和VGG19。使用VGG16模型的代码如下: + +```python +out = vgg.vgg16(image, class_dim=CLASS_DIM) +``` + +类似地,VGG13和VGG19可以分别通过`vgg.vgg13`和`vgg.vgg19`方法获得。 + +3. 使用GoogLeNet模型 + +GoogLeNet在训练阶段使用两个辅助的分类器强化梯度信息并进行额外的正则化。因此`googlenet.googlenet`共返回三个Softmax层,如下面的代码所示: + +```python +out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM) +loss1 = paddle.layer.cross_entropy_cost( + input=out1, label=lbl, coeff=0.3) +paddle.evaluator.classification_error(input=out1, label=lbl) +loss2 = paddle.layer.cross_entropy_cost( + input=out2, label=lbl, coeff=0.3) +paddle.evaluator.classification_error(input=out2, label=lbl) +extra_layers = [loss1, loss2] +``` + +对于两个辅助的输出,这里分别对其计算损失函数并评价错误率,然后将损失作为后文SGD的extra_layers。 + +4. 使用ResNet模型 + +ResNet模型可以通过下面的代码获取: + +```python +out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM) +``` + +### 定义损失函数 + +```python +cost = paddle.layer.classification_cost(input=out, label=lbl) +``` + +### 创建参数和优化方法 + +```python +# Create parameters +parameters = paddle.parameters.create(cost) + +# Create optimizer +optimizer = paddle.optimizer.Momentum( + momentum=0.9, + regularization=paddle.optimizer.L2Regularization(rate=0.0005 * + BATCH_SIZE), + learning_rate=0.001 / BATCH_SIZE, + learning_rate_decay_a=0.1, + learning_rate_decay_b=128000 * 35, + learning_rate_schedule="discexp", ) +``` + +通过 `learning_rate_decay_a` (简写$a$) 、`learning_rate_decay_b` (简写$b$) 和 `learning_rate_schedule` 指定学习率调整策略,这里采用离散指数的方式调节学习率,计算公式如下, $n$ 代表已经处理过的累计总样本数,$lr_{0}$ 即为参数里设置的 `learning_rate`。 + +$$ lr = lr_{0} * a^ {\lfloor \frac{n}{ b}\rfloor} $$ + + +### 定义数据读取 + +首先以[花卉数据](http://www.robots.ox.ac.uk/~vgg/data/flowers/102/index.html)为例说明如何定义输入。下面的代码定义了花卉数据训练集和验证集的输入: + +```python +train_reader = paddle.batch( + paddle.reader.shuffle( + flowers.train(), + buf_size=1000), + batch_size=BATCH_SIZE) +test_reader = paddle.batch( + flowers.valid(), + batch_size=BATCH_SIZE) +``` + +若需要使用其他数据,则需要先建立图像列表文件。`reader.py`定义了这种文件的读取方式,它从图像列表文件中解析出图像路径和类别标签。 + +图像列表文件是一个文本文件,其中每一行由一个图像路径和类别标签构成,二者以跳格符(Tab)隔开。类别标签用整数表示,其最小值为0。下面给出一个图像列表文件的片段示例: + +``` +dataset_100/train_images/n03982430_23191.jpeg 1 +dataset_100/train_images/n04461696_23653.jpeg 7 +dataset_100/train_images/n02441942_3170.jpeg 8 +dataset_100/train_images/n03733281_31716.jpeg 2 +dataset_100/train_images/n03424325_240.jpeg 0 +dataset_100/train_images/n02643566_75.jpeg 8 +``` + +训练时需要分别指定训练集和验证集的图像列表文件。这里假设这两个文件分别为`train.list`和`val.list`,数据读取方式如下: + +```python +train_reader = paddle.batch( + paddle.reader.shuffle( + reader.train_reader('train.list'), + buf_size=1000), + batch_size=BATCH_SIZE) +test_reader = paddle.batch( + reader.test_reader('val.list'), + batch_size=BATCH_SIZE) +``` + +### 定义事件处理程序 +```python +# End batch and end pass event handler +def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 1 == 0: + print "\nPass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, event.metrics) + if isinstance(event, paddle.event.EndPass): + with gzip.open('params_pass_%d.tar.gz' % event.pass_id, 'w') as f: + parameters.to_tar(f) + + result = trainer.test(reader=test_reader) + print "\nTest with Pass %d, %s" % (event.pass_id, result.metrics) +``` + +### 定义训练方法 + +对于AlexNet、VGG和ResNet,可以按下面的代码定义训练方法: + +```python +# Create trainer +trainer = paddle.trainer.SGD( + cost=cost, + parameters=parameters, + update_equation=optimizer) +``` + +GoogLeNet有两个额外的输出层,因此需要指定`extra_layers`,如下所示: + +```python +# Create trainer +trainer = paddle.trainer.SGD( + cost=cost, + parameters=parameters, + update_equation=optimizer, + extra_layers=extra_layers) +``` + +### 开始训练 + +```python +trainer.train( + reader=train_reader, num_passes=200, event_handler=event_handler) +``` + +## 应用模型 +模型训练好后,可以使用下面的代码预测给定图片的类别。 + +```python +# load parameters +with gzip.open('params_pass_10.tar.gz', 'r') as f: + parameters = paddle.parameters.Parameters.from_tar(f) + +file_list = [line.strip() for line in open(image_list_file)] +test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False) + .flatten().astype('float32'), ) + for image_file in file_list] +probs = paddle.infer( + output_layer=out, parameters=parameters, input=test_data) +lab = np.argsort(-probs) +for file_name, result in zip(file_list, lab): + print "Label of %s is: %d" % (file_name, result[0]) +``` + +首先从文件中加载训练好的模型(代码里以第10轮迭代的结果为例),然后读取`image_list_file`中的图像。`image_list_file`是一个文本文件,每一行为一个图像路径。代码使用`paddle.infer`判断`image_list_file`中每个图像的类别,并进行输出。 diff --git a/image_classification/alexnet.py b/image_classification/alexnet.py new file mode 100644 index 0000000000..5262a97faf --- /dev/null +++ b/image_classification/alexnet.py @@ -0,0 +1,50 @@ +import paddle.v2 as paddle + +__all__ = ['alexnet'] + + +def alexnet(input, class_dim): + conv1 = paddle.layer.img_conv( + input=input, + filter_size=11, + num_channels=3, + num_filters=96, + stride=4, + padding=1) + cmrnorm1 = paddle.layer.img_cmrnorm( + input=conv1, size=5, scale=0.0001, power=0.75) + pool1 = paddle.layer.img_pool(input=cmrnorm1, pool_size=3, stride=2) + + conv2 = paddle.layer.img_conv( + input=pool1, + filter_size=5, + num_filters=256, + stride=1, + padding=2, + groups=1) + cmrnorm2 = paddle.layer.img_cmrnorm( + input=conv2, size=5, scale=0.0001, power=0.75) + pool2 = paddle.layer.img_pool(input=cmrnorm2, pool_size=3, stride=2) + + pool3 = paddle.networks.img_conv_group( + input=pool2, + pool_size=3, + pool_stride=2, + conv_num_filter=[384, 384, 256], + conv_filter_size=3, + pool_type=paddle.pooling.Max()) + + fc1 = paddle.layer.fc( + input=pool3, + size=4096, + act=paddle.activation.Relu(), + layer_attr=paddle.attr.Extra(drop_rate=0.5)) + fc2 = paddle.layer.fc( + input=fc1, + size=4096, + act=paddle.activation.Relu(), + layer_attr=paddle.attr.Extra(drop_rate=0.5)) + + out = paddle.layer.fc( + input=fc2, size=class_dim, act=paddle.activation.Softmax()) + return out diff --git a/image_classification/googlenet.py b/image_classification/googlenet.py new file mode 100644 index 0000000000..474f948f02 --- /dev/null +++ b/image_classification/googlenet.py @@ -0,0 +1,180 @@ +import paddle.v2 as paddle + +__all__ = ['googlenet'] + + +def inception(name, input, channels, filter1, filter3R, filter3, filter5R, + filter5, proj): + cov1 = paddle.layer.img_conv( + name=name + '_1', + input=input, + filter_size=1, + num_channels=channels, + num_filters=filter1, + stride=1, + padding=0) + + cov3r = paddle.layer.img_conv( + name=name + '_3r', + input=input, + filter_size=1, + num_channels=channels, + num_filters=filter3R, + stride=1, + padding=0) + cov3 = paddle.layer.img_conv( + name=name + '_3', + input=cov3r, + filter_size=3, + num_filters=filter3, + stride=1, + padding=1) + + cov5r = paddle.layer.img_conv( + name=name + '_5r', + input=input, + filter_size=1, + num_channels=channels, + num_filters=filter5R, + stride=1, + padding=0) + cov5 = paddle.layer.img_conv( + name=name + '_5', + input=cov5r, + filter_size=5, + num_filters=filter5, + stride=1, + padding=2) + + pool1 = paddle.layer.img_pool( + name=name + '_max', + input=input, + pool_size=3, + num_channels=channels, + stride=1, + padding=1) + covprj = paddle.layer.img_conv( + name=name + '_proj', + input=pool1, + filter_size=1, + num_filters=proj, + stride=1, + padding=0) + + cat = paddle.layer.concat(name=name, input=[cov1, cov3, cov5, covprj]) + return cat + + +def googlenet(input, class_dim): + # stage 1 + conv1 = paddle.layer.img_conv( + name="conv1", + input=input, + filter_size=7, + num_channels=3, + num_filters=64, + stride=2, + padding=3) + pool1 = paddle.layer.img_pool( + name="pool1", input=conv1, pool_size=3, num_channels=64, stride=2) + + # stage 2 + conv2_1 = paddle.layer.img_conv( + name="conv2_1", + input=pool1, + filter_size=1, + num_filters=64, + stride=1, + padding=0) + conv2_2 = paddle.layer.img_conv( + name="conv2_2", + input=conv2_1, + filter_size=3, + num_filters=192, + stride=1, + padding=1) + pool2 = paddle.layer.img_pool( + name="pool2", input=conv2_2, pool_size=3, num_channels=192, stride=2) + + # stage 3 + ince3a = inception("ince3a", pool2, 192, 64, 96, 128, 16, 32, 32) + ince3b = inception("ince3b", ince3a, 256, 128, 128, 192, 32, 96, 64) + pool3 = paddle.layer.img_pool( + name="pool3", input=ince3b, num_channels=480, pool_size=3, stride=2) + + # stage 4 + ince4a = inception("ince4a", pool3, 480, 192, 96, 208, 16, 48, 64) + ince4b = inception("ince4b", ince4a, 512, 160, 112, 224, 24, 64, 64) + ince4c = inception("ince4c", ince4b, 512, 128, 128, 256, 24, 64, 64) + ince4d = inception("ince4d", ince4c, 512, 112, 144, 288, 32, 64, 64) + ince4e = inception("ince4e", ince4d, 528, 256, 160, 320, 32, 128, 128) + pool4 = paddle.layer.img_pool( + name="pool4", input=ince4e, num_channels=832, pool_size=3, stride=2) + + # stage 5 + ince5a = inception("ince5a", pool4, 832, 256, 160, 320, 32, 128, 128) + ince5b = inception("ince5b", ince5a, 832, 384, 192, 384, 48, 128, 128) + pool5 = paddle.layer.img_pool( + name="pool5", + input=ince5b, + num_channels=1024, + pool_size=7, + stride=7, + pool_type=paddle.pooling.Avg()) + dropout = paddle.layer.addto( + input=pool5, + layer_attr=paddle.attr.Extra(drop_rate=0.4), + act=paddle.activation.Linear()) + + out = paddle.layer.fc( + input=dropout, size=class_dim, act=paddle.activation.Softmax()) + + # fc for output 1 + pool_o1 = paddle.layer.img_pool( + name="pool_o1", + input=ince4a, + num_channels=512, + pool_size=5, + stride=3, + pool_type=paddle.pooling.Avg()) + conv_o1 = paddle.layer.img_conv( + name="conv_o1", + input=pool_o1, + filter_size=1, + num_filters=128, + stride=1, + padding=0) + fc_o1 = paddle.layer.fc( + name="fc_o1", + input=conv_o1, + size=1024, + layer_attr=paddle.attr.Extra(drop_rate=0.7), + act=paddle.activation.Relu()) + out1 = paddle.layer.fc( + input=fc_o1, size=class_dim, act=paddle.activation.Softmax()) + + # fc for output 2 + pool_o2 = paddle.layer.img_pool( + name="pool_o2", + input=ince4d, + num_channels=528, + pool_size=5, + stride=3, + pool_type=paddle.pooling.Avg()) + conv_o2 = paddle.layer.img_conv( + name="conv_o2", + input=pool_o2, + filter_size=1, + num_filters=128, + stride=1, + padding=0) + fc_o2 = paddle.layer.fc( + name="fc_o2", + input=conv_o2, + size=1024, + layer_attr=paddle.attr.Extra(drop_rate=0.7), + act=paddle.activation.Relu()) + out2 = paddle.layer.fc( + input=fc_o2, size=class_dim, act=paddle.activation.Softmax()) + + return out, out1, out2 diff --git a/image_classification/infer.py b/image_classification/infer.py new file mode 100644 index 0000000000..659c4f2a8e --- /dev/null +++ b/image_classification/infer.py @@ -0,0 +1,68 @@ +import gzip +import paddle.v2 as paddle +import reader +import vgg +import resnet +import alexnet +import googlenet +import argparse +import os +from PIL import Image +import numpy as np + +WIDTH = 224 +HEIGHT = 224 +DATA_DIM = 3 * WIDTH * HEIGHT +CLASS_DIM = 102 + + +def main(): + # parse the argument + parser = argparse.ArgumentParser() + parser.add_argument( + 'data_list', + help='The path of data list file, which consists of one image path per line' + ) + parser.add_argument( + 'model', + help='The model for image classification', + choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet']) + parser.add_argument( + 'params_path', help='The file which stores the parameters') + args = parser.parse_args() + + # PaddlePaddle init + paddle.init(use_gpu=True, trainer_count=1) + + image = paddle.layer.data( + name="image", type=paddle.data_type.dense_vector(DATA_DIM)) + + if args.model == 'alexnet': + out = alexnet.alexnet(image, class_dim=CLASS_DIM) + elif args.model == 'vgg13': + out = vgg.vgg13(image, class_dim=CLASS_DIM) + elif args.model == 'vgg16': + out = vgg.vgg16(image, class_dim=CLASS_DIM) + elif args.model == 'vgg19': + out = vgg.vgg19(image, class_dim=CLASS_DIM) + elif args.model == 'resnet': + out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM) + elif args.model == 'googlenet': + out, _, _ = googlenet.googlenet(image, class_dim=CLASS_DIM) + + # load parameters + with gzip.open(args.params_path, 'r') as f: + parameters = paddle.parameters.Parameters.from_tar(f) + + file_list = [line.strip() for line in open(args.data_list)] + test_data = [(paddle.image.load_and_transform(image_file, 256, 224, False) + .flatten().astype('float32'), ) for image_file in file_list] + probs = paddle.infer( + output_layer=out, parameters=parameters, input=test_data) + lab = np.argsort(-probs) + for file_name, result in zip(file_list, lab): + print "Label of %s is: %d" % (file_name, result[0]) + + +if __name__ == '__main__': + main() diff --git a/image_classification/reader.py b/image_classification/reader.py index 4040222728..b6bad1a24c 100644 --- a/image_classification/reader.py +++ b/image_classification/reader.py @@ -1,30 +1,51 @@ import random from paddle.v2.image import load_and_transform +import paddle.v2 as paddle +from multiprocessing import cpu_count -def train_reader(train_list): +def train_mapper(sample): + ''' + map image path to type needed by model input layer for the training set + ''' + img, label = sample + img = paddle.image.load_image(img) + img = paddle.image.simple_transform(img, 256, 224, True) + return img.flatten().astype('float32'), label + + +def test_mapper(sample): + ''' + map image path to type needed by model input layer for the test set + ''' + img, label = sample + img = paddle.image.load_image(img) + img = paddle.image.simple_transform(img, 256, 224, True) + return img.flatten().astype('float32'), label + + +def train_reader(train_list, buffered_size=1024): def reader(): with open(train_list, 'r') as f: lines = [line.strip() for line in f] - random.shuffle(lines) for line in lines: img_path, lab = line.strip().split('\t') - im = load_and_transform(img_path, 256, 224, True) - yield im.flatten().astype('float32'), int(lab) + yield img_path, int(lab) - return reader + return paddle.reader.xmap_readers(train_mapper, reader, + cpu_count(), buffered_size) -def test_reader(test_list): +def test_reader(test_list, buffered_size=1024): def reader(): with open(test_list, 'r') as f: lines = [line.strip() for line in f] for line in lines: img_path, lab = line.strip().split('\t') - im = load_and_transform(img_path, 256, 224, False) - yield im.flatten().astype('float32'), int(lab) + yield img_path, int(lab) - return reader + return paddle.reader.xmap_readers(test_mapper, reader, + cpu_count(), buffered_size) if __name__ == '__main__': diff --git a/image_classification/resnet.py b/image_classification/resnet.py new file mode 100644 index 0000000000..5a9f24322c --- /dev/null +++ b/image_classification/resnet.py @@ -0,0 +1,95 @@ +import paddle.v2 as paddle + +__all__ = ['resnet_imagenet', 'resnet_cifar10'] + + +def conv_bn_layer(input, + ch_out, + filter_size, + stride, + padding, + active_type=paddle.activation.Relu(), + ch_in=None): + tmp = paddle.layer.img_conv( + input=input, + filter_size=filter_size, + num_channels=ch_in, + num_filters=ch_out, + stride=stride, + padding=padding, + act=paddle.activation.Linear(), + bias_attr=False) + return paddle.layer.batch_norm(input=tmp, act=active_type) + + +def shortcut(input, ch_in, ch_out, stride): + if ch_in != ch_out: + return conv_bn_layer(input, ch_out, 1, stride, 0, + paddle.activation.Linear()) + else: + return input + + +def basicblock(input, ch_in, ch_out, stride): + short = shortcut(input, ch_in, ch_out, stride) + conv1 = conv_bn_layer(input, ch_out, 3, stride, 1) + conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1, paddle.activation.Linear()) + return paddle.layer.addto( + input=[short, conv2], act=paddle.activation.Relu()) + + +def bottleneck(input, ch_in, ch_out, stride): + short = shortcut(input, ch_in, ch_out * 4, stride) + conv1 = conv_bn_layer(input, ch_out, 1, stride, 0) + conv2 = conv_bn_layer(conv1, ch_out, 3, 1, 1) + conv3 = conv_bn_layer(conv2, ch_out * 4, 1, 1, 0, + paddle.activation.Linear()) + return paddle.layer.addto( + input=[short, conv3], act=paddle.activation.Relu()) + + +def layer_warp(block_func, input, ch_in, ch_out, count, stride): + conv = block_func(input, ch_in, ch_out, stride) + for i in range(1, count): + conv = block_func(conv, ch_out, ch_out, 1) + return conv + + +def resnet_imagenet(input, class_dim, depth=50): + cfg = { + 18: ([2, 2, 2, 1], basicblock), + 34: ([3, 4, 6, 3], basicblock), + 50: ([3, 4, 6, 3], bottleneck), + 101: ([3, 4, 23, 3], bottleneck), + 152: ([3, 8, 36, 3], bottleneck) + } + stages, block_func = cfg[depth] + conv1 = conv_bn_layer( + input, ch_in=3, ch_out=64, filter_size=7, stride=2, padding=3) + pool1 = paddle.layer.img_pool(input=conv1, pool_size=3, stride=2) + res1 = layer_warp(block_func, pool1, 64, 64, stages[0], 1) + res2 = layer_warp(block_func, res1, 64, 128, stages[1], 2) + res3 = layer_warp(block_func, res2, 128, 256, stages[2], 2) + res4 = layer_warp(block_func, res3, 256, 512, stages[3], 2) + pool2 = paddle.layer.img_pool( + input=res4, pool_size=7, stride=1, pool_type=paddle.pooling.Avg()) + out = paddle.layer.fc( + input=pool2, size=class_dim, act=paddle.activation.Softmax()) + return out + + +def resnet_cifar10(input, class_dim, depth=32): + # depth should be one of 20, 32, 44, 56, 110, 1202 + assert (depth - 2) % 6 == 0 + n = (depth - 2) / 6 + nStages = {16, 64, 128} + conv1 = conv_bn_layer( + input, ch_in=3, ch_out=16, filter_size=3, stride=1, padding=1) + res1 = layer_warp(basicblock, conv1, 16, 16, n, 1) + res2 = layer_warp(basicblock, res1, 16, 32, n, 2) + res3 = layer_warp(basicblock, res2, 32, 64, n, 2) + pool = paddle.layer.img_pool( + input=res3, pool_size=8, stride=1, pool_type=paddle.pooling.Avg()) + out = paddle.layer.fc( + input=pool, size=class_dim, act=paddle.activation.Softmax()) + return out diff --git a/image_classification/train.py b/image_classification/train.py old mode 100644 new mode 100755 index 5a5e48a386..37f5deec28 --- a/image_classification/train.py +++ b/image_classification/train.py @@ -1,26 +1,58 @@ import gzip - +import paddle.v2.dataset.flowers as flowers import paddle.v2 as paddle import reader import vgg +import resnet +import alexnet +import googlenet +import argparse DATA_DIM = 3 * 224 * 224 -CLASS_DIM = 1000 +CLASS_DIM = 102 BATCH_SIZE = 128 def main(): + # parse the argument + parser = argparse.ArgumentParser() + parser.add_argument( + 'model', + help='The model for image classification', + choices=['alexnet', 'vgg13', 'vgg16', 'vgg19', 'resnet', 'googlenet']) + args = parser.parse_args() # PaddlePaddle init - paddle.init(use_gpu=True, trainer_count=4) + paddle.init(use_gpu=True, trainer_count=1) image = paddle.layer.data( name="image", type=paddle.data_type.dense_vector(DATA_DIM)) lbl = paddle.layer.data( name="label", type=paddle.data_type.integer_value(CLASS_DIM)) - net = vgg.vgg13(image) - out = paddle.layer.fc( - input=net, size=CLASS_DIM, act=paddle.activation.Softmax()) + + extra_layers = None + learning_rate = 0.01 + if args.model == 'alexnet': + out = alexnet.alexnet(image, class_dim=CLASS_DIM) + elif args.model == 'vgg13': + out = vgg.vgg13(image, class_dim=CLASS_DIM) + elif args.model == 'vgg16': + out = vgg.vgg16(image, class_dim=CLASS_DIM) + elif args.model == 'vgg19': + out = vgg.vgg19(image, class_dim=CLASS_DIM) + elif args.model == 'resnet': + out = resnet.resnet_imagenet(image, class_dim=CLASS_DIM) + learning_rate = 0.1 + elif args.model == 'googlenet': + out, out1, out2 = googlenet.googlenet(image, class_dim=CLASS_DIM) + loss1 = paddle.layer.cross_entropy_cost( + input=out1, label=lbl, coeff=0.3) + paddle.evaluator.classification_error(input=out1, label=lbl) + loss2 = paddle.layer.cross_entropy_cost( + input=out2, label=lbl, coeff=0.3) + paddle.evaluator.classification_error(input=out2, label=lbl) + extra_layers = [loss1, loss2] + cost = paddle.layer.classification_cost(input=out, label=lbl) # Create parameters @@ -31,16 +63,23 @@ def main(): momentum=0.9, regularization=paddle.optimizer.L2Regularization(rate=0.0005 * BATCH_SIZE), - learning_rate=0.01 / BATCH_SIZE, + learning_rate=learning_rate / BATCH_SIZE, learning_rate_decay_a=0.1, learning_rate_decay_b=128000 * 35, learning_rate_schedule="discexp", ) train_reader = paddle.batch( - paddle.reader.shuffle(reader.train_reader("train.list"), buf_size=1000), + paddle.reader.shuffle( + flowers.train(), + # To use other data, replace the above line with: + # reader.train_reader('train.list'), + buf_size=1000), batch_size=BATCH_SIZE) test_reader = paddle.batch( - reader.test_reader("test.list"), batch_size=BATCH_SIZE) + flowers.valid(), + # To use other data, replace the above line with: + # reader.test_reader('val.list'), + batch_size=BATCH_SIZE) # End batch and end pass event handler def event_handler(event): @@ -57,11 +96,14 @@ def event_handler(event): # Create trainer trainer = paddle.trainer.SGD( - cost=cost, parameters=parameters, update_equation=optimizer) + cost=cost, + parameters=parameters, + update_equation=optimizer, + extra_layers=extra_layers) trainer.train( reader=train_reader, num_passes=200, event_handler=event_handler) if __name__ == '__main__': - main() + main() \ No newline at end of file diff --git a/image_classification/vgg.py b/image_classification/vgg.py index fdcd351919..c6ec79a8d1 100644 --- a/image_classification/vgg.py +++ b/image_classification/vgg.py @@ -3,7 +3,7 @@ __all__ = ['vgg13', 'vgg16', 'vgg19'] -def vgg(input, nums): +def vgg(input, nums, class_dim): def conv_block(input, num_filter, groups, num_channels=None): return paddle.networks.img_conv_group( input=input, @@ -34,19 +34,21 @@ def conv_block(input, num_filter, groups, num_channels=None): size=fc_dim, act=paddle.activation.Relu(), layer_attr=paddle.attr.Extra(drop_rate=0.5)) - return fc2 + out = paddle.layer.fc( + input=fc2, size=class_dim, act=paddle.activation.Softmax()) + return out -def vgg13(input): +def vgg13(input, class_dim): nums = [2, 2, 2, 2, 2] - return vgg(input, nums) + return vgg(input, nums, class_dim) -def vgg16(input): +def vgg16(input, class_dim): nums = [2, 2, 3, 3, 3] - return vgg(input, nums) + return vgg(input, nums, class_dim) -def vgg19(input): +def vgg19(input, class_dim): nums = [2, 2, 4, 4, 4] - return vgg(input, nums) + return vgg(input, nums, class_dim) diff --git a/scheduled_sampling/README.md b/scheduled_sampling/README.md index a0990367ef..9af4387e12 100644 --- a/scheduled_sampling/README.md +++ b/scheduled_sampling/README.md @@ -1 +1,213 @@ -TBD +# Scheduled Sampling + +## 概述 + +序列生成任务的生成目标是在给定源输入的条件下,最大化目标序列的概率。训练时该模型将目标序列中的真实元素作为解码器每一步的输入,然后最大化下一个元素的概率。生成时上一步解码得到的元素被用作当前的输入,然后生成下一个元素。可见这种情况下训练阶段和生成阶段的解码器输入数据的概率分布并不一致。 + +Scheduled Sampling\[[1](#参考文献)\]是一种解决训练和生成时输入数据分布不一致的方法。在训练早期该方法主要使用目标序列中的真实元素作为解码器输入,可以将模型从随机初始化的状态快速引导至一个合理的状态。随着训练的进行,该方法会逐渐更多地使用生成的元素作为解码器输入,以解决数据分布不一致的问题。 + +标准的序列到序列模型中,如果序列前面生成了错误的元素,后面的输入状态将会收到影响,而该误差会随着生成过程不断向后累积。Scheduled Sampling以一定概率将生成的元素作为解码器输入,这样即使前面生成错误,其训练目标仍然是最大化真实目标序列的概率,模型会朝着正确的方向进行训练。因此这种方式增加了模型的容错能力。 + +## 算法简介 +Scheduled Sampling主要应用在序列到序列模型的训练阶段,而生成阶段则不需要使用。 + +训练阶段解码器在最大化第$t$个元素概率时,标准序列到序列模型使用上一时刻的真实元素$y_{t-1}$作为输入。设上一时刻生成的元素为$g_{t-1}$,Scheduled Sampling算法会以一定概率使用$g_{t-1}$作为解码器输入。 + +设当前已经训练到了第$i$个mini-batch,Scheduled Sampling定义了一个概率$\epsilon_i$控制解码器的输入。$\epsilon_i$是一个随着$i$增大而衰减的变量,常见的定义方式有: + + - 线性衰减:$\epsilon_i=max(\epsilon,k-c*i)$,其中$\epsilon$限制$\epsilon_i$的最小值,$k$和$c$控制线性衰减的幅度。 + + - 指数衰减:$\epsilon_i=k^i$,其中$01$,$k$同样控制衰减的幅度。 + +图1给出了这三种方式的衰减曲线, + +

+
+图1. 线性衰减、指数衰减和反向Sigmoid衰减的衰减曲线 +

+ +如图2所示,在解码器的$t$时刻Scheduled Sampling以概率$\epsilon_i$使用上一时刻的真实元素$y_{t-1}$作为解码器输入,以概率$1-\epsilon_i$使用上一时刻生成的元素$g_{t-1}$作为解码器输入。从图1可知随着$i$的增大$\epsilon_i$会不断减小,解码器将不断倾向于使用生成的元素作为输入,训练阶段和生成阶段的数据分布将变得越来越一致。 + +

+
+图2. Scheduled Sampling选择不同元素作为解码器输入示意图 +

+ +## 模型实现 + +由于Scheduled Sampling是对序列到序列模型的改进,其整体实现框架与序列到序列模型较为相似。为突出本文重点,这里仅介绍与Scheduled Sampling相关的部分,完整的代码见`scheduled_sampling.py`。 + +首先导入需要的包,并定义控制衰减概率的类`RandomScheduleGenerator`,如下: + +```python +import numpy as np +import math + + +class RandomScheduleGenerator: + """ + The random sampling rate for scheduled sampling algoithm, which uses devcayed + sampling rate. + + """ + ... +``` + +下面将分别定义类`RandomScheduleGenerator`的`__init__`、`getScheduleRate`和`processBatch`三个方法。 + +`__init__`方法对类进行初始化,其`schedule_type`参数指定了使用哪种衰减方式,可选的方式有`constant`、`linear`、`exponential`和`inverse_sigmoid`。`constant`指对所有的mini-batch使用固定的$\epsilon_i$,`linear`指线性衰减方式,`exponential`表示指数衰减方式,`inverse_sigmoid`表示反向Sigmoid衰减。`__init__`方法的参数`a`和`b`表示衰减方法的参数,需要在验证集上调优。`self.schedule_computers`将衰减方式映射为计算$\epsilon_i$的函数。最后一行根据`schedule_type`将选择的衰减函数赋给`self.schedule_computer`变量。 + +```python + def __init__(self, schedule_type, a, b): + """ + schduled_type: is the type of the decay. It supports constant, linear, + exponential, and inverse_sigmoid right now. + a: parameter of the decay (MUST BE DOUBLE) + b: parameter of the decay (MUST BE DOUBLE) + """ + self.schedule_type = schedule_type + self.a = a + self.b = b + self.data_processed_ = 0 + self.schedule_computers = { + "constant": lambda a, b, d: a, + "linear": lambda a, b, d: max(a, 1 - d / b), + "exponential": lambda a, b, d: pow(a, d / b), + "inverse_sigmoid": lambda a, b, d: b / (b + math.exp(d * a / b)), + } + assert (self.schedule_type in self.schedule_computers) + self.schedule_computer = self.schedule_computers[self.schedule_type] +``` + +`getScheduleRate`根据衰减函数和已经处理的数据量计算$\epsilon_i$。 + +```python + def getScheduleRate(self): + """ + Get the schedule sampling rate. Usually not needed to be called by the users + """ + return self.schedule_computer(self.a, self.b, self.data_processed_) + +``` + +`processBatch`方法根据概率值$\epsilon_i$进行采样,得到`indexes`,`indexes`中每个元素取值为`0`的概率为$\epsilon_i$,取值为`1`的概率为$1-\epsilon_i$。`indexes`决定了解码器的输入是真实元素还是生成的元素,取值为`0`表示使用真实元素,取值为`1`表示使用生成的元素。 + +```python + def processBatch(self, batch_size): + """ + Get a batch_size of sampled indexes. These indexes can be passed to a + MultiplexLayer to select from the grouth truth and generated samples + from the last time step. + """ + rate = self.getScheduleRate() + numbers = np.random.rand(batch_size) + indexes = (numbers >= rate).astype('int32').tolist() + self.data_processed_ += batch_size + return indexes +``` + +Scheduled Sampling需要在序列到序列模型的基础上增加一个输入`true_token_flag`,以控制解码器输入。 + +```python +true_token_flags = paddle.layer.data( + name='true_token_flag', + type=paddle.data_type.integer_value_sequence(2)) +``` + +这里还需要对原始reader进行封装,增加`true_token_flag`的数据生成器。下面以线性衰减为例说明如何调用上面定义的`RandomScheduleGenerator`产生`true_token_flag`的输入数据。 + +```python +schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000) + +def gen_schedule_data(reader): + """ + Creates a data reader for scheduled sampling. + + Output from the iterator that created by original reader will be + appended with "true_token_flag" to indicate whether to use true token. + + :param reader: the original reader. + :type reader: callable + + :return: the new reader with the field "true_token_flag". + :rtype: callable + """ + + def data_reader(): + for src_ids, trg_ids, trg_ids_next in reader(): + yield src_ids, trg_ids, trg_ids_next, \ + [0] + schedule_generator.processBatch(len(trg_ids) - 1) + + return data_reader +``` + +这段代码在原始输入数据(即源序列元素`src_ids`、目标序列元素`trg_ids`和目标序列下一个元素`trg_ids_next`)后追加了控制解码器输入的数据。由于解码器第一个元素是序列开始符,因此将追加的数据第一个元素设置为`0`,表示解码器第一步始终使用真实目标序列的第一个元素(即序列开始符)。 + +训练时`recurrent_group`每一步调用的解码器函数如下: + +```python + def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word, + true_token_flag): + """ + The decoder step for training. + :param enc_vec: the encoder vector for attention + :type enc_vec: LayerOutput + :param enc_proj: the encoder projection for attention + :type enc_proj: LayerOutput + :param true_word: the ground-truth target word + :type true_word: LayerOutput + :param true_token_flag: the flag of using the ground-truth target word + :type true_token_flag: LayerOutput + :return: the softmax output layer + :rtype: LayerOutput + """ + + decoder_mem = paddle.layer.memory( + name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) + + context = paddle.networks.simple_attention( + encoded_sequence=enc_vec, + encoded_proj=enc_proj, + decoder_state=decoder_mem) + + gru_out_memory = paddle.layer.memory( + name='gru_out', size=target_dict_dim) + + generated_word = paddle.layer.max_id(input=gru_out_memory) + + generated_word_emb = paddle.layer.embedding( + input=generated_word, + size=word_vector_dim, + param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) + + current_word = paddle.layer.multiplex( + input=[true_token_flag, true_word, generated_word_emb]) + + with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs: + decoder_inputs += paddle.layer.full_matrix_projection(input=context) + decoder_inputs += paddle.layer.full_matrix_projection( + input=current_word) + + gru_step = paddle.layer.gru_step( + name='gru_decoder', + input=decoder_inputs, + output_mem=decoder_mem, + size=decoder_size) + + with paddle.layer.mixed( + name='gru_out', + size=target_dict_dim, + bias_attr=True, + act=paddle.activation.Softmax()) as out: + out += paddle.layer.full_matrix_projection(input=gru_step) + + return out +``` + +该函数使用`memory`层`gru_out_memory`记忆上一时刻生成的元素,根据`gru_out_memory`选择概率最大的词语`generated_word`作为生成的词语。`multiplex`层会在真实元素`true_word`和生成的元素`generated_word`之间做出选择,并将选择的结果作为解码器输入。`multiplex`层使用了三个输入,分别为`true_token_flag`、`true_word`和`generated_word_emb`。对于这三个输入中每个元素,若`true_token_flag`中的值为`0`,则`multiplex`层输出`true_word`中的相应元素;若`true_token_flag`中的值为`1`,则`multiplex`层输出`generated_word_emb`中的相应元素。 + +## 参考文献 + +[1] Bengio S, Vinyals O, Jaitly N, et al. [Scheduled sampling for sequence prediction with recurrent neural networks](http://papers.nips.cc/paper/5956-scheduled-sampling-for-sequence-prediction-with-recurrent-neural-networks)//Advances in Neural Information Processing Systems. 2015: 1171-1179. diff --git a/scheduled_sampling/img/Scheduled_Sampling.jpg b/scheduled_sampling/img/Scheduled_Sampling.jpg new file mode 100644 index 0000000000..27f568a45f Binary files /dev/null and b/scheduled_sampling/img/Scheduled_Sampling.jpg differ diff --git a/scheduled_sampling/img/decay.jpg b/scheduled_sampling/img/decay.jpg new file mode 100644 index 0000000000..ea0532750a Binary files /dev/null and b/scheduled_sampling/img/decay.jpg differ diff --git a/scheduled_sampling/random_schedule_generator.py b/scheduled_sampling/random_schedule_generator.py new file mode 100644 index 0000000000..7569eaffc2 --- /dev/null +++ b/scheduled_sampling/random_schedule_generator.py @@ -0,0 +1,47 @@ +import numpy as np +import math + + +class RandomScheduleGenerator: + """ + The random sampling rate for scheduled sampling algoithm, which uses devcayed + sampling rate. + """ + + def __init__(self, schedule_type, a, b): + """ + schduled_type: is the type of the decay. It supports constant, linear, + exponential, and inverse_sigmoid right now. + a: parameter of the decay (MUST BE DOUBLE) + b: parameter of the decay (MUST BE DOUBLE) + """ + self.schedule_type = schedule_type + self.a = a + self.b = b + self.data_processed_ = 0 + self.schedule_computers = { + "constant": lambda a, b, d: a, + "linear": lambda a, b, d: max(a, 1 - d / b), + "exponential": lambda a, b, d: pow(a, d / b), + "inverse_sigmoid": lambda a, b, d: b / (b + math.exp(d * a / b)), + } + assert (self.schedule_type in self.schedule_computers) + self.schedule_computer = self.schedule_computers[self.schedule_type] + + def getScheduleRate(self): + """ + Get the schedule sampling rate. Usually not needed to be called by the users + """ + return self.schedule_computer(self.a, self.b, self.data_processed_) + + def processBatch(self, batch_size): + """ + Get a batch_size of sampled indexes. These indexes can be passed to a + MultiplexLayer to select from the grouth truth and generated samples + from the last time step. + """ + rate = self.getScheduleRate() + numbers = np.random.rand(batch_size) + indexes = (numbers >= rate).astype('int32').tolist() + self.data_processed_ += batch_size + return indexes diff --git a/scheduled_sampling/scheduled_sampling.py b/scheduled_sampling/scheduled_sampling.py new file mode 100644 index 0000000000..e2e328ea6a --- /dev/null +++ b/scheduled_sampling/scheduled_sampling.py @@ -0,0 +1,323 @@ +import sys +import paddle.v2 as paddle +from random_schedule_generator import RandomScheduleGenerator + +schedule_generator = RandomScheduleGenerator("linear", 0.75, 1000000) + + +def gen_schedule_data(reader): + """ + Creates a data reader for scheduled sampling. + + Output from the iterator that created by original reader will be + appended with "true_token_flag" to indicate whether to use true token. + + :param reader: the original reader. + :type reader: callable + + :return: the new reader with the field "true_token_flag". + :rtype: callable + """ + + def data_reader(): + for src_ids, trg_ids, trg_ids_next in reader(): + yield src_ids, trg_ids, trg_ids_next, \ + [0] + schedule_generator.processBatch(len(trg_ids) - 1) + + return data_reader + + +def seqToseq_net(source_dict_dim, target_dict_dim, is_generating=False): + """ + The definition of the sequence to sequence model + :param source_dict_dim: the dictionary size of the source language + :type source_dict_dim: int + :param target_dict_dim: the dictionary size of the target language + :type target_dict_dim: int + :param is_generating: whether in generating mode + :type is_generating: Bool + :return: the last layer of the network + :rtype: LayerOutput + """ + ### Network Architecture + word_vector_dim = 512 # dimension of word vector + decoder_size = 512 # dimension of hidden unit in GRU Decoder network + encoder_size = 512 # dimension of hidden unit in GRU Encoder network + + beam_size = 3 + max_length = 250 + + #### Encoder + src_word_id = paddle.layer.data( + name='source_language_word', + type=paddle.data_type.integer_value_sequence(source_dict_dim)) + src_embedding = paddle.layer.embedding( + input=src_word_id, size=word_vector_dim) + src_forward = paddle.networks.simple_gru( + input=src_embedding, size=encoder_size) + src_backward = paddle.networks.simple_gru( + input=src_embedding, size=encoder_size, reverse=True) + encoded_vector = paddle.layer.concat(input=[src_forward, src_backward]) + + #### Decoder + with paddle.layer.mixed(size=decoder_size) as encoded_proj: + encoded_proj += paddle.layer.full_matrix_projection( + input=encoded_vector) + + backward_first = paddle.layer.first_seq(input=src_backward) + + with paddle.layer.mixed( + size=decoder_size, act=paddle.activation.Tanh()) as decoder_boot: + decoder_boot += paddle.layer.full_matrix_projection( + input=backward_first) + + def gru_decoder_with_attention_train(enc_vec, enc_proj, true_word, + true_token_flag): + """ + The decoder step for training. + :param enc_vec: the encoder vector for attention + :type enc_vec: LayerOutput + :param enc_proj: the encoder projection for attention + :type enc_proj: LayerOutput + :param true_word: the ground-truth target word + :type true_word: LayerOutput + :param true_token_flag: the flag of using the ground-truth target word + :type true_token_flag: LayerOutput + :return: the softmax output layer + :rtype: LayerOutput + """ + + decoder_mem = paddle.layer.memory( + name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) + + context = paddle.networks.simple_attention( + encoded_sequence=enc_vec, + encoded_proj=enc_proj, + decoder_state=decoder_mem) + + gru_out_memory = paddle.layer.memory( + name='gru_out', size=target_dict_dim) + + generated_word = paddle.layer.max_id(input=gru_out_memory) + + generated_word_emb = paddle.layer.embedding( + input=generated_word, + size=word_vector_dim, + param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) + + current_word = paddle.layer.multiplex( + input=[true_token_flag, true_word, generated_word_emb]) + + with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs: + decoder_inputs += paddle.layer.full_matrix_projection(input=context) + decoder_inputs += paddle.layer.full_matrix_projection( + input=current_word) + + gru_step = paddle.layer.gru_step( + name='gru_decoder', + input=decoder_inputs, + output_mem=decoder_mem, + size=decoder_size) + + with paddle.layer.mixed( + name='gru_out', + size=target_dict_dim, + bias_attr=True, + act=paddle.activation.Softmax()) as out: + out += paddle.layer.full_matrix_projection(input=gru_step) + + return out + + def gru_decoder_with_attention_test(enc_vec, enc_proj, current_word): + """ + The decoder step for generating. + :param enc_vec: the encoder vector for attention + :type enc_vec: LayerOutput + :param enc_proj: the encoder projection for attention + :type enc_proj: LayerOutput + :param current_word: the previously generated word + :type current_word: LayerOutput + :return: the softmax output layer + :rtype: LayerOutput + """ + + decoder_mem = paddle.layer.memory( + name='gru_decoder', size=decoder_size, boot_layer=decoder_boot) + + context = paddle.networks.simple_attention( + encoded_sequence=enc_vec, + encoded_proj=enc_proj, + decoder_state=decoder_mem) + + with paddle.layer.mixed(size=decoder_size * 3) as decoder_inputs: + decoder_inputs += paddle.layer.full_matrix_projection(input=context) + decoder_inputs += paddle.layer.full_matrix_projection( + input=current_word) + + gru_step = paddle.layer.gru_step( + name='gru_decoder', + input=decoder_inputs, + output_mem=decoder_mem, + size=decoder_size) + + with paddle.layer.mixed( + size=target_dict_dim, + bias_attr=True, + act=paddle.activation.Softmax()) as out: + out += paddle.layer.full_matrix_projection(input=gru_step) + return out + + decoder_group_name = "decoder_group" + group_input1 = paddle.layer.StaticInput(input=encoded_vector, is_seq=True) + group_input2 = paddle.layer.StaticInput(input=encoded_proj, is_seq=True) + group_inputs = [group_input1, group_input2] + + if not is_generating: + trg_embedding = paddle.layer.embedding( + input=paddle.layer.data( + name='target_language_word', + type=paddle.data_type.integer_value_sequence(target_dict_dim)), + size=word_vector_dim, + param_attr=paddle.attr.ParamAttr(name='_target_language_embedding')) + group_inputs.append(trg_embedding) + + true_token_flags = paddle.layer.data( + name='true_token_flag', + type=paddle.data_type.integer_value_sequence(2)) + group_inputs.append(true_token_flags) + + decoder = paddle.layer.recurrent_group( + name=decoder_group_name, + step=gru_decoder_with_attention_train, + input=group_inputs) + + lbl = paddle.layer.data( + name='target_language_next_word', + type=paddle.data_type.integer_value_sequence(target_dict_dim)) + cost = paddle.layer.classification_cost(input=decoder, label=lbl) + + return cost + else: + trg_embedding = paddle.layer.GeneratedInput( + size=target_dict_dim, + embedding_name='_target_language_embedding', + embedding_size=word_vector_dim) + group_inputs.append(trg_embedding) + + beam_gen = paddle.layer.beam_search( + name=decoder_group_name, + step=gru_decoder_with_attention_test, + input=group_inputs, + bos_id=0, + eos_id=1, + beam_size=beam_size, + max_length=max_length) + + return beam_gen + + +def main(): + paddle.init(use_gpu=False, trainer_count=1) + is_generating = False + model_path_for_generating = 'params_pass_1.tar.gz' + + # source and target dict dim. + dict_size = 30000 + source_dict_dim = target_dict_dim = dict_size + + # train the network + if not is_generating: + cost = seqToseq_net(source_dict_dim, target_dict_dim) + parameters = paddle.parameters.create(cost) + + # define optimize method and trainer + optimizer = paddle.optimizer.Adam( + learning_rate=5e-5, + regularization=paddle.optimizer.L2Regularization(rate=8e-4)) + trainer = paddle.trainer.SGD( + cost=cost, parameters=parameters, update_equation=optimizer) + # define data reader + wmt14_reader = paddle.batch( + gen_schedule_data( + paddle.reader.shuffle( + paddle.dataset.wmt14.train(dict_size), buf_size=8192)), + batch_size=5) + + feeding = { + 'source_language_word': 0, + 'target_language_word': 1, + 'target_language_next_word': 2, + 'true_token_flag': 3 + } + + # define event_handler callback + def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 10 == 0: + print "\nPass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, + event.metrics) + else: + sys.stdout.write('.') + sys.stdout.flush() + if isinstance(event, paddle.event.EndPass): + # save parameters + with gzip.open('params_pass_%d.tar.gz' % event.pass_id, + 'w') as f: + parameters.to_tar(f) + + # start to train + trainer.train( + reader=wmt14_reader, + event_handler=event_handler, + feeding=feeding, + num_passes=2) + + # generate a english sequence to french + else: + # use the first 3 samples for generation + gen_creator = paddle.dataset.wmt14.gen(dict_size) + gen_data = [] + gen_num = 3 + for item in gen_creator(): + gen_data.append((item[0], )) + if len(gen_data) == gen_num: + break + + beam_gen = seqToseq_net(source_dict_dim, target_dict_dim, is_generating) + # get the trained model + with gzip.open(model_path_for_generating, 'r') as f: + parameters = Parameters.from_tar(f) + # prob is the prediction probabilities, and id is the prediction word. + beam_result = paddle.infer( + output_layer=beam_gen, + parameters=parameters, + input=gen_data, + field=['prob', 'id']) + + # get the dictionary + src_dict, trg_dict = paddle.dataset.wmt14.get_dict(dict_size) + + # the delimited element of generated sequences is -1, + # the first element of each generated sequence is the sequence length + seq_list = [] + seq = [] + for w in beam_result[1]: + if w != -1: + seq.append(w) + else: + seq_list.append(' '.join([trg_dict.get(w) for w in seq[1:]])) + seq = [] + + prob = beam_result[0] + beam_size = 3 + for i in xrange(gen_num): + print "\n*******************************************************\n" + print "src:", ' '.join( + [src_dict.get(w) for w in gen_data[i][0]]), "\n" + for j in xrange(beam_size): + print "prob = %f:" % (prob[i][j]), seq_list[i * beam_size + j] + + +if __name__ == '__main__': + main() diff --git a/sequence_tagging_for_ner/README.md b/sequence_tagging_for_ner/README.md index fadfe4564a..0f635a5290 100644 --- a/sequence_tagging_for_ner/README.md +++ b/sequence_tagging_for_ner/README.md @@ -1,23 +1,40 @@ # 命名实体识别 +以下是本例的简要目录结构及说明: + +```text +. +├── data # 存储运行本例所依赖的数据 +│   ├── download.sh +├── images # README 文档中的图片 +├── index.html +├── infer.py # 测试脚本 +├── network_conf.py # 模型定义 +├── reader.py # 数据读取接口 +├── README.md # 文档 +├── train.py # 训练脚本 +└── utils.py # 定义同样的函数 +``` + + +## 简介 + 命名实体识别(Named Entity Recognition,NER)又称作“专名识别”,是指识别文本中具有特定意义的实体,主要包括人名、地名、机构名、专有名词等,是自然语言处理研究的一个基础问题。NER任务通常包括实体边界识别、确定实体类别两部分,可以将其作为序列标注问题解决。 -序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO方式](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集,如下是一个NER的标注结果示例: +序列标注可以分为Sequence Classification、Segment Classification和Temporal Classification三类[[1](#参考文献)],本例只考虑Segment Classification,即对输入序列中的每个元素在输出序列中给出对应的标签。对于NER任务,由于需要标识边界,一般采用[BIO标注方法](http://book.paddlepaddle.org/07.label_semantic_roles/)定义的标签集,如下是一个NER的标注结果示例:

图1. BIO标注方法示例
-根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务同样可通过序列标注来解决。 - -由于序列标注问题的广泛性,产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等经典的序列模型,这些模型大多只能使用局部信息或需要人工设计特征。随着深度学习研究的发展,循环神经网络(Recurrent Neural Network,RNN等序列模型能够处理序列元素之间前后关联问题,能够从原始输入文本中学习特征表示,而更加适合序列标注任务,更多相关知识可参考PaddleBook中[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一课。 +根据序列标注结果可以直接得到实体边界和实体类别。类似的,分词、词性标注、语块识别、[语义角色标注](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等任务都可通过序列标注来解决。使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中。 -使用神经网络模型解决问题的思路通常是:前层网络学习输入的特征表示,网络的最后一层在特征基础上完成最终的任务;对于序列标注问题,通常:使用基于RNN的网络结构学习特征,将学习到的特征接入CRF完成序列标注。实际上是将传统CRF中的线性模型换成了非线性神经网络。沿用CRF的出发点是:CRF使用句子级别的似然概率,能够更好的解决标记偏置问题[[2](#参考文献)]。本例也将基于此思路建立模型。虽然,这里以NER任务作为示例,但所给出的模型可以应用到其他各种序列标注任务中。 +由于序列标注问题的广泛性,产生了[CRF](http://book.paddlepaddle.org/07.label_semantic_roles/index.cn.html)等经典的序列模型,这些模型大多只能使用局部信息或需要人工设计特征。随着深度学习研究的发展,循环神经网络(Recurrent Neural Network,RNN等 序列模型能够处理序列元素之间前后关联问题,能够从原始输入文本中学习特征表示,而更加适合序列标注任务,更多相关知识可参考PaddleBook中[语义角色标注](https://github.com/PaddlePaddle/book/blob/develop/07.label_semantic_roles/README.cn.md)一课。 -## 模型说明 +## 模型详解 -NER任务的输入是"一句话",目标是识别句子中的实体边界及类别,我们参照论文\[[2](#参考文献)\]仅对原始句子进行了一些预处理工作:将每个词转换为小写,并将原词是否大写另作为一个特征,共同作为模型的输入。按照上述处理序列标注问题的思路,可构造如下结构的模型(图2是模型结构示意图): +NER任务的输入是"一句话",目标是识别句子中的实体边界及类别,我们参照论文\[[2](#参考文献)\]仅对原始句子进行了一些简单的预处理工作:将每个词转换为小写,并将原词是否大写另作为一个特征,共同作为模型的输入。模型如图2所示,工作流程如下: 1. 构造输入 - 输入1是句子序列,采用one-hot方式表示 @@ -28,51 +45,47 @@ NER任务的输入是"一句话",目标是识别句子中的实体边界及类

-图2. NER模型的网络结构图 +图2. NER 模型网络结构图
## 数据说明 -在本例中,我们使用CoNLL 2003 NER任务中开放出的数据集。该任务(见[此页面](http://www.clips.uantwerpen.be/conll2003/ner/))只提供了标注工具的下载,原始Reuters数据由于版权原因需另外申请免费下载。在获取原始数据后可参照标注工具中README生成所需数据文件,完成后将包括如下三个数据文件: +在本例中,我们以 [CoNLL 2003 NER任务](http://www.clips.uantwerpen.be/conll2003/ner/)为例,原始Reuters数据由于版权原因需另外申请免费下载,请大家按照原网站说明获取。 -| 文件名 | 描述 | -|---|---| -| eng.train | 训练数据 | -| eng.testa | 验证数据,可用来进行参数调优 | -| eng.testb | 评估数据,用来进行最终效果评估 | ++ 我们仅在`data`目录下的`train`和`test`文件中放置少数样本用以示例输入数据格式。 ++ 本例依赖数据还包括 + 1. 输入文本的词典 + 2. 为词典中的词语提供预训练好的词向量 + 2. 标记标签的词典 + 标记标签词典已附在`data`目录中,对应于`data/target.txt`文件。输入文本的词典以及词典中词语的预训练的词向量来自:[Stanford CS224d](http://cs224d.stanford.edu/)课程作业。**为运行本例,请首先在`data`目录下运行`download.sh`脚本下载输入文本的词典和预训练的词向量。** 完成后会将这两个文件一并放入`data`目录下,输入文本的词典和预训练的词向量分别对应:`data/vocab.txt`和`data/wordVectors.txt`这两个文件。 -为保证本例的完整性,我们从中抽取少量样本放在`data/train`和`data/test`文件中,作为示例使用;由于版权原因,完整数据还请大家自行获取。这三个文件数据格式如下: +CoNLL 2003原始数据格式如下: ``` - U.N. NNP I-NP I-ORG - official NN I-NP O - Ekeus NNP I-NP I-PER - heads VBZ I-VP O - for IN I-PP O - Baghdad NNP I-NP I-LOC - . . O O +U.N. NNP I-NP I-ORG +official NN I-NP O +Ekeus NNP I-NP I-PER +heads VBZ I-VP O +for IN I-PP O +Baghdad NNP I-NP I-LOC +. . O O ``` -其中第一列为原始句子序列(第二、三列分别为词性标签和句法分析中的语块标签,这里暂时不用),第四列为采用了I-TYPE方式表示的NER标签(I-TYPE和BIO方式的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),句子之间以空行分隔。 +- 第一列为原始句子序列 +- 第二、三列分别为词性标签和句法分析中的语块标签,本例不使用 +- 第四列为采用了 I-TYPE 方式表示的NER标签 + - I-TYPE 和 BIO 方式的主要区别在于语块开始标记的使用上,I-TYPE只有在出现相邻的同类别实体时对后者使用B标记,其他均使用I标记),句子之间以空行分隔。 -原始数据需要进行数据预处理才能被PaddlePaddle处理,预处理主要包括下面几个步骤: +我们在`reader.py`脚本中完成对原始数据的处理以及读取,主要包括下面几个步骤: 1. 从原始数据文件中抽取出句子和标签,构造句子序列和标签序列; -2. 将I-TYPE表示的标签转换为BIO方式表示的标签; +2. 将 I-TYPE 表示的标签转换为 BIO 方式表示的标签; 3. 将句子序列中的单词转换为小写,并构造大写标记序列; 4. 依据词典获取词对应的整数索引。 -我们将在`conll03.py`中完成以上预处理工作(使用方法将在后文给出): -```python -# import conll03 -# conll03.corpus_reader函数完成上面第1步和第2步. -# conll03.reader_creator函数完成上面第3步和第4步. -# conll03.train和conll03.test函数可以获取处理之后的每条样本来供PaddlePaddle训练和测试. -``` - -预处理完成后,一条训练样本包含3个部分:句子序列、首字母大写标记序列、标注序列。下表是一条训练样本的示例。 +预处理完成后,一条训练样本包含3个部分作为神经网络的输入信息用于训练:(1)句子序列;(2)首字母大写标记序列;(3)标注序列,下表是一条训练样本的示例: | 句子序列 | 大写标记序列 | 标注序列 | |---|---|---| @@ -84,165 +97,65 @@ NER任务的输入是"一句话",目标是识别句子中的实体边界及类 | baghdad | 1 | B-LOC | | . | 0 | O | -另外,本例依赖的数据还包括:word词典、label词典和预训练的词向量三个文件。label词典已附在`data`目录中,对应于`data/target.txt`;word词典和预训练的词向量来源于[Stanford CS224d](http://cs224d.stanford.edu/)课程作业,请先在该示例所在目录下运行`data/download.sh`脚本进行下载,完成后会将这两个文件一并放入`data`目录下,分别对应`data/vocab.txt`和`data/wordVectors.txt`。 - -## 使用说明 - -本示例给出的`conll03.py`和`ner.py`两个Python脚本分别提供了数据相关和模型相关接口。 - -### 数据接口使用 - -`conll03.py`提供了使用CoNLL 2003数据的接口,各主要函数的功能已在数据说明部分进行说明。结合我们提供的接口和文件,可以按照如下步骤使用CoNLL 2003数据: +## 运行 +### 编写数据读取接口 -1. 定义各数据文件、词典文件和词向量文件路径; -2. 调用`conll03.train`和`conll03.test`接口。 +自定义数据读取接口只需编写一个 Python 生成器实现从原始输入文本中解析一条训练样本的逻辑。[reader.py](./reader.py) 中的`data_reader`函数实现了读取原始数据返回类型为: `paddle.data_type.integer_value_sequence`的 3 个输入(分别对应:词语在字典的序号、是否为大写、标注结果在字典中的序号)给`network_conf.ner_net`中定义的 3 个 `data_layer` 的功能。 -对应如下代码: +### 训练 -```python -import conll03 +1. 运行 `sh data/download.sh` +2. 修改 `train.py` 的 `main` 函数,指定数据路径 -# 修改以下变量为对应文件路径 -train_data_file = 'data/train' # 训练数据文件的路径 -test_data_file = 'data/test' # 测试数据文件的路径 -vocab_file = 'data/vocab.txt' # 输入句子对应的字典文件的路径 -target_file = 'data/target.txt' # 标签对应的字典文件的路径 -emb_file = 'data/wordVectors.txt' # 预训练的词向量参数的路径 - -# 返回训练数据的生成器 -train_data_reader = conll03.train(train_data_file, vocab_file, target_file) -# 返回测试数据的生成器 -test_data_reader = conll03.test(test_data_file, vocab_file, target_file) -``` - -### 模型接口使用 - -`ner.py`提供了以下两个接口分别进行模型训练和预测: - -1. `ner_net_train(data_reader, num_passes)`函数实现了模型训练功能,参数`data_reader`表示训练数据的迭代器、`num_passes`表示训练pass的轮数。训练过程中每100个iteration会打印模型训练信息。我们同时在模型配置中加入了chunk evaluator,会输出当前模型对语块识别的Precision、Recall和F1值。chunk evaluator 的详细使用说明请参照[文档](http://www.paddlepaddle.org/develop/doc/api/v2/config/evaluators.html#chunk)。每个pass后会将模型保存为`params_pass_***.tar.gz`的文件(`***`表示pass的id)。 - -2. `ner_net_infer(data_reader, model_file)`函数实现了预测功能,参数`data_reader`表示测试数据的迭代器、`model_file`表示保存在本地的模型文件,预测过程会按如下格式打印预测结果: - - ``` - U.N. B-ORG - official O - Ekeus B-PER - heads O - for O - Baghdad B-LOC - . O + ```python + main( + train_data_file='data/train', + test_data_file='data/test', + vocab_file='data/vocab.txt', + target_file='data/target.txt', + emb_file='data/wordVectors.txt') ``` - 其中第一列为原始句子序列,第二列为BIO方式表示的NER标签。 - -### 运行程序 - -本例另在`ner.py`中提供了完整的运行流程,包括数据接口的使用和模型训练、预测。根据上文所述的接口使用方法,使用时需要将`ner.py`中如下的数据设置部分中的各变量修改为对应文件路径: -```python -# 修改以下变量为对应文件路径 -train_data_file = 'data/train' # 训练数据文件的路径 -test_data_file = 'data/test' # 测试数据文件的路径 -vocab_file = 'data/vocab.txt' # 输入句子对应的字典文件的路径 -target_file = 'data/target.txt' # 标签对应的字典文件的路径 -emb_file = 'data/wordVectors.txt' # 预训练的词向量参数的路径 -``` - -各接口的调用已在`ner.py`中提供: - -```python -# 训练数据的生成器 -train_data_reader = conll03.train(train_data_file, vocab_file, target_file) -# 测试数据的生成器 -test_data_reader = conll03.test(test_data_file, vocab_file, target_file) - -# 模型训练 -ner_net_train(data_reader=train_data_reader, num_passes=1) -# 预测 -ner_net_infer(data_reader=test_data_reader, model_file='params_pass_0.tar.gz') -``` - -为运行序列标注模型除适当调整`num_passes`和`model_file`两参数值外,无需再做其它修改(也可根据需要自行调用各接口,如只使用预测功能)。完成修改后,运行本示例只需在`ner.py`所在路径下执行`python ner.py`即可。该示例程序会执行数据读取、模型训练和保存、模型读取及新样本预测等步骤。 - -### 自定义数据和任务 - -前文提到本例中的模型可以应用到其他序列标注任务中,这里以词性标注任务为例,给出使用其他数据,并应用到其他任务的操作方法。 - -假定有如下格式的原始数据: - -``` -U.N. NNP -official NN -Ekeus NNP -heads VBZ -for IN -Baghdad NNP -. . -``` - -第一列为原始句子序列,第二列为词性标签序列,两列之间以“\t”分隔,句子之间以空行分隔。 - -为使用PaddlePaddle和本示例提供的模型,可参照`conll03.py`并根据需要自定义数据接口,如下: +3. 运行命令 `python train.py` ,**需要注意:直接运行使用的是示例数据,请替换真实的标记数据。** -1. 参照`conll03.py`中的`corpus_reader`函数,定义接口返回句子序列和标签序列生成器; - - ```python - # 实现句子和对应标签的抽取,传入数据文件路径,返回句子和标签序列生成器。 - def corpus_reader(filename): - def reader(): - sentence = [] - labels = [] - with open(filename) as f: - for line in f: - if len(line.strip()) == 0: - if len(sentence) > 0: - yield sentence, labels - sentence = [] - labels = [] - else: - segs = line.strip().split() - sentence.append(segs[0]) - labels.append(segs[-1]) - f.close() - - return reader + ```text + commandline: --use_gpu=False --trainer_count=1 + Initing parameters.. + Init parameters done. + Pass 0, Batch 0, Cost 41.430110, {'ner_chunk.precision': 0.01587301678955555, 'ner_chunk.F1-score': 0.028368793427944183, 'ner_chunk.recall': 0.13333334028720856, 'error': 0.939393937587738} + Test with Pass 0, Batch 0, {'ner_chunk.precision': 0.0, 'ner_chunk.F1-score': 0.0, 'ner_chunk.recall': 0.0, 'error': 0.16260161995887756} ``` -2. 参照`conll03.py`中的`reader_creator`函数,定义接口返回id化的句子和标签序列生成器。 +### 预测 +1. 修改 [infer.py](./infer.py) 的 `main` 函数,指定:需要测试的模型的路径、测试数据、字典文件,预测标记文件的路径,默认参数如下: ```python - # 传入corpus_reader返回的生成器、dict类型的word词典和label词典,返回id化的句子和标签序列生成器。 - def reader_creator(corpus_reader, word_dict, label_dict): - def reader(): - for sentence, labels in corpus_reader(): - word_idx = [ - word_dict.get(w, UNK_IDX) # 若使用小写单词,请使用w.lower() - for w in sentence - ] - # 若使用首字母大写标记,请去掉以下注释符号,并在yield语句的word_idx后加上mark - # mark = [ - # 1 if w[0].isupper() else 0 - # for w in sentence - # ] - label_idx = [label_dict.get(w) for w in labels] - yield word_idx, label_idx, sentence # 加上sentence方便预测时打印 - return reader + infer( + model_path="models/params_pass_0.tar.gz", + batch_size=2, + test_data_file="data/test", + vocab_file="data/vocab.txt", + target_file="data/target.txt") ``` -自定义了数据接口后,要使用本示例中的模型,只需在调用模型训练和预测接口`ner_net_train`和`ner_net_infer`时传入调用`reader_creator`返回的生成器即可。另外需要注意,这里给出的数据接口定义去掉了`conll03.py`一些预处理(使用原始句子,而非转换成小写单词加上大写标记),`ner.py`中的模型相关接口也需要进行一些调整: - -1. 修改网络结构定义接口`ner_net`中大写标记相关内容: +2. 在终端运行 `python infer.py`,开始测试,会看到如下预测结果: + + ```text + cricket O + - O + leicestershire O + take O + over O + at O + top O + after O + innings O + victory O + . O - 删去`mark`和`mark_embedding`两个变量; - -2. 修改模型训练接口`ner_net_train`中大写标记相关内容: - - 将变量`feeding`定义改为`feeding = {'word': 0, 'target': 1}`; - -3. 修改预测接口`ner_net_infer`中大写标记相关内容: - - 将`test_data.append([item[0], item[1]])`改为`test_data.append([item[0]])`。 + ``` + 输出分为两列,以“\t” 分隔,第一列是输入的词语,第二列是标记结果。多条输入序列之间以空行分隔。 -如果要继续使用NER中的特征预处理(小写单词、大写标记),请参照上文`reader_creator`代码段给出的注释进行修改,此时`ner.py`中的模型相关接口不必进行修改。 ## 参考文献 diff --git a/sequence_tagging_for_ner/data/download.sh b/sequence_tagging_for_ner/data/download.sh index 9782c63b35..fc9de3d7f2 100644 --- a/sequence_tagging_for_ner/data/download.sh +++ b/sequence_tagging_for_ner/data/download.sh @@ -1,6 +1,12 @@ wget http://cs224d.stanford.edu/assignment2/assignment2.zip -unzip assignment2.zip -cp assignment2_release/data/ner/wordVectors.txt ./ -cp assignment2_release/data/ner/vocab.txt ./ -rm -rf assignment2.zip assignment2_release + +if [ $? -eq 0 ];then + unzip assignment2.zip + cp assignment2_release/data/ner/wordVectors.txt ./data + cp assignment2_release/data/ner/vocab.txt ./data + rm -rf assignment2.zip assignment2_release +else + echo "download data error!" >> /dev/stderr + exit 1 +fi diff --git a/sequence_tagging_for_ner/index.html b/sequence_tagging_for_ner/index.html index b7c6c8994a..b36eea0116 100644 --- a/sequence_tagging_for_ner/index.html +++ b/sequence_tagging_for_ner/index.html @@ -42,24 +42,41 @@